ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network

[version 1; peer review: 2 approved with reservations]
PUBLISHED 24 May 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cytoscape gateway.

Abstract

Identification of novel disease-gene and disease-disease associations is an important task in biomedical research. Recently, we have developed a Cytoscape app, namely HGPEC, using a state-of-the-art network-based method for such task. This paper describes an upgrading version of HGPEC, namely autoHGPEC, with added automation features. By adding these functions, autoHGPEC can be used as a component of other complex analysis pipelines as well as make use of other data resources. We demonstrated the use of autoHGPEC by predicting novel breast cancer-associated genes and diseases. Further investigation by visualizing and collecting evidences for associations between top 20 ranked genes/diseases and breast cancer has shown the ability of autoHGPEC.

Keywords

Cytoscape app, Automation features, CyREST, R, Disease-gene association, Disease-disease association, Random walk with restart algorithm, Heterogeneous network, Gene prioritization, Disease prioritization

Introduction

One of the challenging tasks in biomedicine is to prioritize candidate genes and diseases by the degree of their relevance to a disease of interest. This is the starting point to identify novel disease-gene and disease-disease associations. A large number of computational methods including network- and machine learning-based ones have been proposed for such a task1,2. State-of-the-art network-based methods often integrate diseases and genes together to form a heterogeneous network, then a propagation algorithm is applied to exploit the similarity between diseases/genes and known disease-gene associations to predict novel associations37. Some tools have been also developed to facilitate the use of the state-of-the-art methods. However, most of them only focus on predicting novel disease-gene associations810, including some tools which were developed as apps of Cytoscape11. Recently, we have developed a Cytoscape app, HGPEC12, to predict both disease-gene and disease-disease associations based on a state-of-the-art method on a heterogeneous network of diseases and genes3. HGPEC was shown to be better than two other network-based Cytoscape apps for prediction of novel disease-gene associations, GPEC13 and PRINCIPLE14 in terms of prediction performance12. In addition, HGPEC can prioritize candidate genes of diseases without known molecular basis and collect evidence to support novel predictions from various data resources such as Gene Ontology15, Disease Ontology16, KEGG pathway17, GeneRIF18, PubMed19, protein complexes20 and OMIM21. Being developed as an app of Cytoscape, HGPEC can exploit advanced features of Cytoscape such as data visualization and integration. However, Cytoscape is a desktop-based tool, thus HGPEC cannot link to other analysis tools such as R and Python flexibly. Therefore, this also limits the use of HGPEC because it cannot be used automatically as a component of a complex analysis pipeline in these tools. In addition, this prevents Cytoscape from integrating data from other data resources. Recently, automation features have been added to Cytoscape to facilitate those tasks.

In this study, we upgrade HGPEC by adding automation features into it and name the new app as autoHGPEC. Basically, autoHGPEC has the same functions as HGPEC. However, these functions can be called by both CyREST functions and commands, thus can be called from external environments. To use autoHGPEC, a heterogeneous network of diseases and genes composing of a disease similarity network, a gene/protein network and known disease-gene associations has to be given. Then, a disease of interest must be selected from the disease similarity network. After that, the disease and its known associated genes (if any) are used as training/seed data. A set of candidate genes then has to be defined by selecting from the gene network or chromosome. These candidate genes and all remaining diseases are then ranked by a RWRH-based method (see the Methods section). Finally, users can select top ranked genes/diseases for further analyses such as visualization and evidence collection. We show the ability of autoHGPEC in predicting novel genes and diseases associated with breast cancer.

Methods

RWRH-based method

autoHGPEC was implemented using a ranking algorithm, random walk with restart on a heterogeneous network (RWRH)12. Briefly, this network-based algorithm propagates the disease information embedded in a disease of interest and its known associated genes (also known as seed/training nodes) to other diseases and genes in the heterogeneous network. This propagation is performed by random walking from the seed nodes. At each node, the random walker goes to adjacency nodes or goes back to the seed nodes with a prior probability. This process is repeated iteratively until a steady-state is reached. A score assigned to each node at this state represents the degree of relevance to the seed nodes, thus relevance to the disease of interest. Finally, candidate genes and diseases are ranked by the scores and top ranked candidates can be selected as promising genes and diseases for further investigation.

Implementation

autoHGPEC is an upgrading version of HGPEC12 with added automation features. Therefore, main functions such as prioritization, visualization and evidence collection of HGPEC were kept. In addition, as in HGPEC, a number of databases were preinstalled in autoHGPEC to facilitate the use of this app. These include disease similarity networks, gene/protein networks and known disease-gene associations as well as annotation data such as Gene Ontology15, Disease Ontology16, KEGG pathways17, GeneRIF18, and protein complexes20. However, users can also select other networks by themselves. In order to provide automation features for HGPEC, we first refactor source code of HGPEC to implement Cytoscape Tunable annotations to replace control panels of HGPEC in the west by a menu system. Therefore, all the functions of HGPEC are accessed through the menu system. In addition, the workflow of HGPEC is exposed to the users by using CyREST Command API (which can be followed in Swagger UI under the menu autoHGPEC). The CyREST API is developed with appropriated functions as well. Thus, the result of each step in the workflow can be passed on to the caller for further analysis in R or Python in JSON format.

Operation

autoHGPEC is designed to predict novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network with added automation features. Therefore, it operates in the same workflow as in HGPEC12. However, in addition to desktop-based Cytoscape though the menu system, its functions can be called using CyREST Command API and from other analysis tools such as R. Figure 1 show the workflow of autoHGPEC in three running environments (see user manual in Supplementary File 1). As an app of Cytoscape with automation features, autoHGPEC can be run on any computer which satisfies the minimal requirements to run Cytoscape.

3dd6a7f3-6a53-4c8a-a743-60ca84e2443a_figure1.gif

Figure 1. Workflow of autoHGPEC in three environments (i.e., Menu-based Cytoscape, CyREST Command API and R).

Use cases

To demonstrate functions of autoHGPEC with automation features, we showed its ability in predicting novel genes and diseases associated with breast cancer (OMIM ID: 114480). Here, we briefly describe this case study by following the 5-step workflow in Figure 1 (see user manual in Supplementary File 1 for more detail):

  • - First, a heterogeneous network of genes and diseases was constructed by connecting a preinstalled disease similarity network (i.e., Disease_Similarity_Network_5) including 5,080 diseases and 19,729 interactions, a preinstalled human protein interaction network (i.e., Default_Human_PPI_Network) including 10,486 genes and 50,791 interactions, and known disease-gene associations collected from OMIM21. This step can be accomplished by following commands from within R:

    • > commandRun('autoHGPEC step1_construct_network DiseaseGene="Disease-gene from OMIM" diseaseNetwork="Disease_Similarity_Network_5" geneNetwork="Default_Human_PPI_Network"')

  • - Second, breast cancer (OMIM ID: 114480) was selected for investigation. This disease is known to be associated with 21 genes, which are also available in the human protein interaction network. Then, the training set was built with these genes and the disease of interest. We can run two following commands within R for this task:

    • > commandRun('autoHGPEC step2_1_select_disease diseaseName="breast cancer"')

    • > commandRun('autoHGPEC step2_2_create_training_list diseaseTraining="MIM114480"')

  • - Third, we selected all of 10,465 remaining genes in the protein interaction network as candidate genes. This option can be done by following command:

    • > commandRun('autoHGPEC step3_PCG_allRemaining')

  • - Fourth, all genes and diseases in the heterogeneous network are ranked by applying the RWRH-based method with back-probability, jumping probability and subnetwork importance weight were set to 0.5, 0.6 and 0.7, respectively. The following command can be used to accomplish this task:

    • > commandRun('autoHGPEC step4_prioritize backProb=0.5 jumpProb=0.6 subnetWeight=0.7')

  • - Finally, we visualized and collected evidence for the associations between 20 highly ranked candidate genes/diseases and breast cancer. The users must highlight the diseases and genes of their interest in the corresponding network. These tasks can be performed using two following commands, respectively:

    • > commandRun('autoHGPEC step5_2_visualize')

    • > commandRun('autoHGPEC step5_1_search_evidences')

Visualization results (Figure 2a and b) show that most of the top ranked candidate genes are directly connected to known breast cancer-associated genes. In addition, highly ranked candidate diseases are directly connected to either known/training genes or the disease of interest. For evidence collection, we annotated and searched evidence for promising associations between the top ranked candidate genes/disease and breast cancer. Evidence collection results showed that each of the promising associations is supported by at least two data sources. More detail about interpretation on the results of visualization and evidence collection for these associations can be found in the HGPEC study12. Beside the fact is that almost commands of autoHGPEC return results in JSON format, the results of autoHGPEC is revealed via CyREST API as well (menu Help/Automation/CyREST Api). For example, the command in R, commandRun('autoHGPEC step2_1_select_disease diseaseName="breast cancer"'), in Step 2 can be performed directly by CyREST API with the request URL http://localhost:1234/autohgpec/v1/selectDisease/breast%20cancer (this URL is available after successfully constructing the heterogeneous network in Step 1). Then, it returns a list of OMIM IDs associated with “breast cancer” in JSON format as follows:

[

  {

   "name": "BREAST CANCER 1 GENE; BRCA1",

   "DiseaseID": "MIM113705",

   "MedGenCUI": ""

  },

  {

   "name": "BREAST CANCER",

   "DiseaseID": "MIM114480",

   "MedGenCUI": "C0346153",

   "AssociatedGenes": "5888, 3845, 83990, 8493, 580, 841, 3161, 7517, 9821, 79728, 5245, 5002, 672, 675, 5290, 11200, 207, 472, 4835, 999, 7157, 8438"

  },

  {

   "name": "BREAST-OVARIAN CANCER, FAMILIAL, SUSCEPTIBILITY TO, 1; BROVCA1",

   "DiseaseID": "MIM604370",

   "MedGenCUI": "C2676676",

   "AssociatedGenes": "4978, 2956, 672, 5290, 207, 5071"

  },

  {

   "name": "BREAST CANCER ANTIESTROGEN RESISTANCE 3; BCAR3",

   "DiseaseID": "MIM604704",

   "MedGenCUI": ""

  }

]

Therefore, users can easily call this CyREST API and use this result in their workflow as they need.

3dd6a7f3-6a53-4c8a-a743-60ca84e2443a_figure2.gif

Figure 2. Visualization of highly ranked candidate genes and diseases in topological relationships with breast cancer.

(a) Topological relationships between highly ranked candidate genes and known breast cancer-associated genes. (b) Topological relationships between highly ranked candidate diseases and breast cancer and its known associated genes. Note that: For diseases, nodes in rhombus and rectangle shapes are breast cancer and candidate diseases, respectively. For genes, nodes in triangle and octagon shapes are known breast cancer-associated genes and candiate genes, respectively. Nodes with high rankings are in red, relative high are in pink, medium are in white and light green, low are in green.

Discussion and conclusions

Random walk with restart algorithm on heterogeneous network of diseases and genes was shown as a state-of-the-art method for predicting novel disease-gene and disease-disease associations compared to other network-based algorithms3,12. However, its prediction performance highly depends on the used heterogeneous network, which is a combination of a disease similarity network and a gene/protein interaction network and known disease-gene associations. Indeed, a study showed that the prediction performance can be improved by using a gene ontology-based gene similarity network instead of using the human protein interaction network22. In addition, we have recently shown that using the disease similarity network constructed by Human Phenotype Ontology23 improved the prediction performance of disease-associated genes24 as well as disease-associated non-coding RNAs25,26. Therefore, to facilitate the use of the similarity networks of diseases/genes, we enable user to provide these networks by themselves. For gene/protein network, user can import the network from various molecular interaction data sources or from other analysis pipelines. Similarly, disease similarity networks can be inputted from other analysis tools such as DOSim27 and HPOSim28. Moreover, the ranked candidate genes can be used as inputs of other annotation and enrichment toolkits to support more about their associations with the disease of interest such as DAVID29 and GSEA30. Taken together, with added automation features, autoHGPEC can be more useful and reached by a wider range of users.

Summary

Identification of novel disease-gene and disease-disease associations is an important task in biomedical research. Recently, we have developed a Cytoscape app, namely HGPEC, using a state-of-the-art network-based method for such task. This paper describes an upgrading version of HGPEC, namely autoHGPEC, with added automation features. By adding these functions, autoHGPEC can be used as a component of other complex analysis pipelines as well as make use of other data resources. We demonstrated the use of autoHGPEC by predicting novel breast cancer-associated genes and diseases. Further investigation by visualizing and collecting evidences for associations between top 20 ranked genes/diseases and breast cancer has shown the ability of autoHGPEC.

Software and data availability

All prerequisite data are already included in the apps. Refer to the user manual (Supplementary File 1) for other additional annotation data such as Gene Ontology.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 24 May 2018
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Le DH and Tran TTH. autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network [version 1; peer review: 2 approved with reservations] F1000Research 2018, 7:658 (https://doi.org/10.12688/f1000research.14810.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 24 May 2018
Views
7
Cite
Reviewer Report 28 Aug 2018
Thanh Le Van, Janssen Pharmaceutica NV, Beerse, Belgium 
Approved with Reservations
VIEWS 7
Summary:

The paper "autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network" presents an enhanced implementation of HGPEC, the previous work of one of the co-authors ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Le Van T. Reviewer Report For: autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network [version 1; peer review: 2 approved with reservations]. F1000Research 2018, 7:658 (https://doi.org/10.5256/f1000research.16119.r36046)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
10
Cite
Reviewer Report 07 Aug 2018
Tin Nguyen, Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA 
Approved with Reservations
VIEWS 10
This paper aims to address the challenging task of trying to prioritize candidate genes and diseases based on their degree of relevance within a known heterogenous network of disease-gene and disease-disease associations. A previous iteration of this Cytoscape app called ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Nguyen T. Reviewer Report For: autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network [version 1; peer review: 2 approved with reservations]. F1000Research 2018, 7:658 (https://doi.org/10.5256/f1000research.16119.r36572)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 24 May 2018
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.