autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network

Duc-Hau Le; Trang T.H. Tran

doi:10.12688/f1000research.14810.1

Home Browse autoHGPEC: Automated prediction of novel disease-gene and disease-disease...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network

[version 1; peer review: 2 approved with reservations]

Duc-Hau Le¹, Trang T.H. Tran¹

PUBLISHED 24 May 2018

Author details Author details

¹ School of Computer Science and Engineering, Thuyloi University, 175 Tay Son, Dong Da, Hanoi, Vietnam

Duc-Hau Le
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Trang T.H. Tran
Roles: Software, Writing – Original Draft Preparation

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Cytoscape gateway.

Abstract

Identification of novel disease-gene and disease-disease associations is an important task in biomedical research. Recently, we have developed a Cytoscape app, namely HGPEC, using a state-of-the-art network-based method for such task. This paper describes an upgrading version of HGPEC, namely autoHGPEC, with added automation features. By adding these functions, autoHGPEC can be used as a component of other complex analysis pipelines as well as make use of other data resources. We demonstrated the use of autoHGPEC by predicting novel breast cancer-associated genes and diseases. Further investigation by visualizing and collecting evidences for associations between top 20 ranked genes/diseases and breast cancer has shown the ability of autoHGPEC.

Keywords

Cytoscape app, Automation features, CyREST, R, Disease-gene association, Disease-disease association, Random walk with restart algorithm, Heterogeneous network, Gene prioritization, Disease prioritization

Corresponding author: Duc-Hau Le

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2018 Le DH and Tran TTH. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Le DH and Tran TTH. autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network [version 1; peer review: 2 approved with reservations]. F1000Research 2018, 7:658 (https://doi.org/10.12688/f1000research.14810.1) First published: 24 May 2018, 7:658 (https://doi.org/10.12688/f1000research.14810.1) Latest published: 24 May 2018, 7:658 (https://doi.org/10.12688/f1000research.14810.1)

Introduction

One of the challenging tasks in biomedicine is to prioritize candidate genes and diseases by the degree of their relevance to a disease of interest. This is the starting point to identify novel disease-gene and disease-disease associations. A large number of computational methods including network- and machine learning-based ones have been proposed for such a task^1,2. State-of-the-art network-based methods often integrate diseases and genes together to form a heterogeneous network, then a propagation algorithm is applied to exploit the similarity between diseases/genes and known disease-gene associations to predict novel associations^3–7. Some tools have been also developed to facilitate the use of the state-of-the-art methods. However, most of them only focus on predicting novel disease-gene associations^8–10, including some tools which were developed as apps of Cytoscape¹¹. Recently, we have developed a Cytoscape app, HGPEC¹², to predict both disease-gene and disease-disease associations based on a state-of-the-art method on a heterogeneous network of diseases and genes³. HGPEC was shown to be better than two other network-based Cytoscape apps for prediction of novel disease-gene associations, GPEC¹³ and PRINCIPLE¹⁴ in terms of prediction performance¹². In addition, HGPEC can prioritize candidate genes of diseases without known molecular basis and collect evidence to support novel predictions from various data resources such as Gene Ontology¹⁵, Disease Ontology¹⁶, KEGG pathway¹⁷, GeneRIF¹⁸, PubMed¹⁹, protein complexes²⁰ and OMIM²¹. Being developed as an app of Cytoscape, HGPEC can exploit advanced features of Cytoscape such as data visualization and integration. However, Cytoscape is a desktop-based tool, thus HGPEC cannot link to other analysis tools such as R and Python flexibly. Therefore, this also limits the use of HGPEC because it cannot be used automatically as a component of a complex analysis pipeline in these tools. In addition, this prevents Cytoscape from integrating data from other data resources. Recently, automation features have been added to Cytoscape to facilitate those tasks.

In this study, we upgrade HGPEC by adding automation features into it and name the new app as autoHGPEC. Basically, autoHGPEC has the same functions as HGPEC. However, these functions can be called by both CyREST functions and commands, thus can be called from external environments. To use autoHGPEC, a heterogeneous network of diseases and genes composing of a disease similarity network, a gene/protein network and known disease-gene associations has to be given. Then, a disease of interest must be selected from the disease similarity network. After that, the disease and its known associated genes (if any) are used as training/seed data. A set of candidate genes then has to be defined by selecting from the gene network or chromosome. These candidate genes and all remaining diseases are then ranked by a RWRH-based method (see the Methods section). Finally, users can select top ranked genes/diseases for further analyses such as visualization and evidence collection. We show the ability of autoHGPEC in predicting novel genes and diseases associated with breast cancer.

Methods

RWRH-based method

autoHGPEC was implemented using a ranking algorithm, random walk with restart on a heterogeneous network (RWRH)¹². Briefly, this network-based algorithm propagates the disease information embedded in a disease of interest and its known associated genes (also known as seed/training nodes) to other diseases and genes in the heterogeneous network. This propagation is performed by random walking from the seed nodes. At each node, the random walker goes to adjacency nodes or goes back to the seed nodes with a prior probability. This process is repeated iteratively until a steady-state is reached. A score assigned to each node at this state represents the degree of relevance to the seed nodes, thus relevance to the disease of interest. Finally, candidate genes and diseases are ranked by the scores and top ranked candidates can be selected as promising genes and diseases for further investigation.

Implementation

autoHGPEC is an upgrading version of HGPEC¹² with added automation features. Therefore, main functions such as prioritization, visualization and evidence collection of HGPEC were kept. In addition, as in HGPEC, a number of databases were preinstalled in autoHGPEC to facilitate the use of this app. These include disease similarity networks, gene/protein networks and known disease-gene associations as well as annotation data such as Gene Ontology¹⁵, Disease Ontology¹⁶, KEGG pathways¹⁷, GeneRIF¹⁸, and protein complexes²⁰. However, users can also select other networks by themselves. In order to provide automation features for HGPEC, we first refactor source code of HGPEC to implement Cytoscape Tunable annotations to replace control panels of HGPEC in the west by a menu system. Therefore, all the functions of HGPEC are accessed through the menu system. In addition, the workflow of HGPEC is exposed to the users by using CyREST Command API (which can be followed in Swagger UI under the menu autoHGPEC). The CyREST API is developed with appropriated functions as well. Thus, the result of each step in the workflow can be passed on to the caller for further analysis in R or Python in JSON format.

Operation

autoHGPEC is designed to predict novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network with added automation features. Therefore, it operates in the same workflow as in HGPEC¹². However, in addition to desktop-based Cytoscape though the menu system, its functions can be called using CyREST Command API and from other analysis tools such as R. Figure 1 show the workflow of autoHGPEC in three running environments (see user manual in Supplementary File 1). As an app of Cytoscape with automation features, autoHGPEC can be run on any computer which satisfies the minimal requirements to run Cytoscape.

Figure 1. Workflow of autoHGPEC in three environments (i.e., Menu-based Cytoscape, CyREST Command API and R).

Use cases

To demonstrate functions of autoHGPEC with automation features, we showed its ability in predicting novel genes and diseases associated with breast cancer (OMIM ID: 114480). Here, we briefly describe this case study by following the 5-step workflow in Figure 1 (see user manual in Supplementary File 1 for more detail):

- First, a heterogeneous network of genes and diseases was constructed by connecting a preinstalled disease similarity network (i.e., Disease_Similarity_Network_5) including 5,080 diseases and 19,729 interactions, a preinstalled human protein interaction network (i.e., Default_Human_PPI_Network) including 10,486 genes and 50,791 interactions, and known disease-gene associations collected from OMIM²¹. This step can be accomplished by following commands from within R:
- > commandRun('autoHGPEC step1_construct_network DiseaseGene="Disease-gene from OMIM" diseaseNetwork="Disease_Similarity_Network_5" geneNetwork="Default_Human_PPI_Network"')
- Second, breast cancer (OMIM ID: 114480) was selected for investigation. This disease is known to be associated with 21 genes, which are also available in the human protein interaction network. Then, the training set was built with these genes and the disease of interest. We can run two following commands within R for this task:
- > commandRun('autoHGPEC step2_1_select_disease diseaseName="breast cancer"')
- > commandRun('autoHGPEC step2_2_create_training_list diseaseTraining="MIM114480"')
- Third, we selected all of 10,465 remaining genes in the protein interaction network as candidate genes. This option can be done by following command:
- > commandRun('autoHGPEC step3_PCG_allRemaining')
- Fourth, all genes and diseases in the heterogeneous network are ranked by applying the RWRH-based method with back-probability, jumping probability and subnetwork importance weight were set to 0.5, 0.6 and 0.7, respectively. The following command can be used to accomplish this task:
- > commandRun('autoHGPEC step4_prioritize backProb=0.5 jumpProb=0.6 subnetWeight=0.7')
- Finally, we visualized and collected evidence for the associations between 20 highly ranked candidate genes/diseases and breast cancer. The users must highlight the diseases and genes of their interest in the corresponding network. These tasks can be performed using two following commands, respectively:
- > commandRun('autoHGPEC step5_2_visualize')
- > commandRun('autoHGPEC step5_1_search_evidences')

Visualization results (Figure 2a and b) show that most of the top ranked candidate genes are directly connected to known breast cancer-associated genes. In addition, highly ranked candidate diseases are directly connected to either known/training genes or the disease of interest. For evidence collection, we annotated and searched evidence for promising associations between the top ranked candidate genes/disease and breast cancer. Evidence collection results showed that each of the promising associations is supported by at least two data sources. More detail about interpretation on the results of visualization and evidence collection for these associations can be found in the HGPEC study¹². Beside the fact is that almost commands of autoHGPEC return results in JSON format, the results of autoHGPEC is revealed via CyREST API as well (menu Help/Automation/CyREST Api). For example, the command in R, commandRun('autoHGPEC step2_1_select_disease diseaseName="breast cancer"'), in Step 2 can be performed directly by CyREST API with the request URL http://localhost:1234/autohgpec/v1/selectDisease/breast%20cancer (this URL is available after successfully constructing the heterogeneous network in Step 1). Then, it returns a list of OMIM IDs associated with “breast cancer” in JSON format as follows:

[

{

"name": "BREAST CANCER 1 GENE; BRCA1",

"DiseaseID": "MIM113705",

"MedGenCUI": ""

},

{

"name": "BREAST CANCER",

"DiseaseID": "MIM114480",

"MedGenCUI": "C0346153",

"AssociatedGenes": "5888, 3845, 83990, 8493, 580, 841, 3161, 7517, 9821, 79728, 5245, 5002, 672, 675, 5290, 11200, 207, 472, 4835, 999, 7157, 8438"

},

{

"name": "BREAST-OVARIAN CANCER, FAMILIAL, SUSCEPTIBILITY TO, 1; BROVCA1",

"DiseaseID": "MIM604370",

"MedGenCUI": "C2676676",

"AssociatedGenes": "4978, 2956, 672, 5290, 207, 5071"

},

{

"name": "BREAST CANCER ANTIESTROGEN RESISTANCE 3; BCAR3",

"DiseaseID": "MIM604704",

"MedGenCUI": ""

}

]

Therefore, users can easily call this CyREST API and use this result in their workflow as they need.

Figure 2. Visualization of highly ranked candidate genes and diseases in topological relationships with breast cancer.

(a) Topological relationships between highly ranked candidate genes and known breast cancer-associated genes. (b) Topological relationships between highly ranked candidate diseases and breast cancer and its known associated genes. Note that: For diseases, nodes in rhombus and rectangle shapes are breast cancer and candidate diseases, respectively. For genes, nodes in triangle and octagon shapes are known breast cancer-associated genes and candiate genes, respectively. Nodes with high rankings are in red, relative high are in pink, medium are in white and light green, low are in green.

Discussion and conclusions

Random walk with restart algorithm on heterogeneous network of diseases and genes was shown as a state-of-the-art method for predicting novel disease-gene and disease-disease associations compared to other network-based algorithms^3,12. However, its prediction performance highly depends on the used heterogeneous network, which is a combination of a disease similarity network and a gene/protein interaction network and known disease-gene associations. Indeed, a study showed that the prediction performance can be improved by using a gene ontology-based gene similarity network instead of using the human protein interaction network²². In addition, we have recently shown that using the disease similarity network constructed by Human Phenotype Ontology²³ improved the prediction performance of disease-associated genes²⁴ as well as disease-associated non-coding RNAs^25,26. Therefore, to facilitate the use of the similarity networks of diseases/genes, we enable user to provide these networks by themselves. For gene/protein network, user can import the network from various molecular interaction data sources or from other analysis pipelines. Similarly, disease similarity networks can be inputted from other analysis tools such as DOSim²⁷ and HPOSim²⁸. Moreover, the ranked candidate genes can be used as inputs of other annotation and enrichment toolkits to support more about their associations with the disease of interest such as DAVID²⁹ and GSEA³⁰. Taken together, with added automation features, autoHGPEC can be more useful and reached by a wider range of users.

Summary

Identification of novel disease-gene and disease-disease associations is an important task in biomedical research. Recently, we have developed a Cytoscape app, namely HGPEC, using a state-of-the-art network-based method for such task. This paper describes an upgrading version of HGPEC, namely autoHGPEC, with added automation features. By adding these functions, autoHGPEC can be used as a component of other complex analysis pipelines as well as make use of other data resources. We demonstrated the use of autoHGPEC by predicting novel breast cancer-associated genes and diseases. Further investigation by visualizing and collecting evidences for associations between top 20 ranked genes/diseases and breast cancer has shown the ability of autoHGPEC.

Software and data availability

1. autoHGPEC on Cytoscape Apps: http://apps.cytoscape.org/apps/autohgpec
2. User manual can be downloaded at https://sites.google.com/site/duchaule2011/bioinformatics-tools/autohgpec
3. autoHGPEC can be run from within R using RCy3 (https://www.bioconductor.org/packages/release/bioc/html/RCy3.html)
4. Source code: https://github.com/trangtran86/autoHGPEC
5. Archived source code as at time of publication: http://doi.org/10.5281/zenodo.1228521³¹
6. License: MIT

All prerequisite data are already included in the apps. Refer to the user manual (Supplementary File 1) for other additional annotation data such as Gene Ontology.

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Supplementary material

Supplementary File 1: autoHGPEC user manual.

Click here to access the data.

Faculty Opinions recommended

References

1. Barabási AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011; 12(1): 56–68. PubMed Abstract | Publisher Full Text | Free Full Text
2. Wang X, Gulbahce N, Yu H: Network-based methods for human disease gene prediction. Brief Funct Genomics. 2011; 10(5): 280–293. PubMed Abstract | Publisher Full Text
3. Li Y, Patra JC: Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network. Bioinformatics. 2010; 26(9): 1219–1224. PubMed Abstract | Publisher Full Text
4. Chen Y, Jiang T, Jiang R: Uncover disease genes by maximizing information flow in the phenome-interactome network. Bioinformatics. 2011; 27(13): i167–i176. PubMed Abstract | Publisher Full Text | Free Full Text
5. Guo X, Gao L, Wei C, et al.: A computational method based on the integration of heterogeneous networks for predicting disease-gene associations. PLoS One. 2011; 6(9): e24171. PubMed Abstract | Publisher Full Text | Free Full Text
6. Le DH, Nguyen MH: Towards more realistic machine learning techniques for prediction of disease-associated genes. In: Proceedings of the Sixth International Symposium on Information and Communication Technology; Hue City, Viet Nam. 2833269: ACM. 2015; 116–120. Publisher Full Text
7. Le DH, Xuan Hoai N, Kwon YK: A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction. In: Knowledge and Systems Engineering. Edited by Nguyen V-H, Le A-C, Huynh V-N, Springer International Publishing; 2015; 326: 577–588. Publisher Full Text
8. Oti M, Ballouz S, Wouters MA: Web tools for the prioritization of candidate disease genes. In: In Silico Tools for Gene Discovery. Methods Mol Biol. 2011; 760: 189–206. PubMed Abstract | Publisher Full Text
9. Tranchevent LC, Capdevila FB, Nitsch D, et al.: A guide to web tools to prioritize candidate genes. Brief Bioinform. 2011; 12(1): 22–32. PubMed Abstract | Publisher Full Text
10. Moreau Y, Tranchevent LC: Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet. 2012; 13(8): 523–536. PubMed Abstract | Publisher Full Text
11. Shannon P, Markiel A, Ozier O, et al.: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13(11): 2498–2504. PubMed Abstract | Publisher Full Text | Free Full Text
12. Le DH, Pham VH: HGPEC: a Cytoscape app for prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network. BMC Syst Biol. 2017; 11(1): 61. PubMed Abstract | Publisher Full Text | Free Full Text
13. Le DH, Kwon YK: GPEC: A Cytoscape plug-in for random walk-based gene prioritization and biomedical evidence collection. Comput Biol Chem. 2012; 37: 17–23. PubMed Abstract | Publisher Full Text
14. Gottlieb A, Magger O, Berman I, et al.: PRINCIPLE: a tool for associating genes with diseases via network propagation. Bioinformatics. 2011; 27(23): 3325–3326. PubMed Abstract | Publisher Full Text
15. Ashburner M, Ball CA, Blake JA, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000; 25(1): 25–29. PubMed Abstract | Publisher Full Text | Free Full Text
16. Schriml LM, Arze C, Nadendla S, et al.: Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res. 2012; 40(Database issue): D940–D946. PubMed Abstract | Publisher Full Text | Free Full Text
17. Kanehisa M, Goto S, Furumichi M, et al.: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010; 38(Database issue): D355–D360. PubMed Abstract | Publisher Full Text | Free Full Text
18. Mitchell JA, Aronson AR, Mork JG, et al.: Gene indexing: characterization and analysis of NLM's GeneRIFs.AMIA Annu Symp Proc. 2003; 2003: 460–4. PubMed Abstract | Free Full Text
19. Sayers EW, Barrett T, Benson DA, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011; 39(Database issue): D38–D51. PubMed Abstract | Publisher Full Text | Free Full Text
20. Ruepp A, Brauner B, Dunger-Kaltenbach I, et al.: CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res. 2008; 36(Database issue): D646–D650. PubMed Abstract | Publisher Full Text | Free Full Text
21. Amberger J, Bocchini CA, Scott AF, et al.: McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res. 2009; 37(Database issue): D793–D796. PubMed Abstract | Publisher Full Text | Free Full Text
22. Jiang R, Gan M, He P: Constructing a gene semantic similarity network for the inference of disease genes. BMC Syst Biol. 2011; 5 Suppl 2: S2. PubMed Abstract | Publisher Full Text | Free Full Text
23. Köhler S, Doelken SC, Mungall CJ, et al.: The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014; 42(Database issue): D966–D974. PubMed Abstract | Publisher Full Text | Free Full Text
24. Le DH, Dang VT: Ontology-based disease similarity network for disease gene prediction. Vietnam Journal of Computer Science. 2016; 3(3): 197–205. Publisher Full Text
25. Le DH: Disease phenotype similarity improves the prediction of novel disease-associated microRNAs. In: Information and Computer Science (NICS), 2015 2nd National Foundation for Science and Technology Development Conference on: 16–18 Sept. 2015. 2015; 76–81. Publisher Full Text
26. Le DH, Dao LTM: Annotating diseases using human phenotype ontology improves prediction of disease-associated long non-coding RNAs. J Mol Biol. 2018; pii: S0022-2836(18)30401-7. PubMed Abstract
27. Li J, Gong B, Chen X, et al.: DOSim: An R package for similarity between diseases based on Disease Ontology. BMC Bioinformatics. 2011; 12(1): 266. PubMed Abstract | Publisher Full Text | Free Full Text
28. Deng Y, Gao L, Wang B, et al.: HPOSim: an R package for phenotypic similarity measure and enrichment analysis based on the human phenotype ontology. PLoS One. 2015; 10(2): e0115692. PubMed Abstract | Publisher Full Text | Free Full Text
29. Dennis G Jr, Sherman BT, Hosack DA, et al.: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003; 4(9): R60. Publisher Full Text | Free Full Text
30. Subramanian A, Tamayo P, Mootha VK, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005; 102(43): 15545–15550. PubMed Abstract | Publisher Full Text | Free Full Text
31. trangtran86: trangtran86/autoHGPEC: First commit (Version 1.0). Zenodo. 2018. Data Source

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 24 May 2018

Author details Author details

¹ School of Computer Science and Engineering, Thuyloi University, 175 Tay Son, Dong Da, Hanoi, Vietnam

Duc-Hau Le
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Trang T.H. Tran
Roles: Software, Writing – Original Draft Preparation

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 24 May 2018, 7:658

https://doi.org/10.12688/f1000research.14810.1

Copyright

© 2018 Le DH and Tran TTH. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Le DH and Tran TTH. autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network [version 1; peer review: 2 approved with reservations] F1000Research 2018, 7:658 (https://doi.org/10.12688/f1000research.14810.1)

NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 24 May 2018

Views

7

Reviewer Report 28 Aug 2018

Thanh Le Van, Janssen Pharmaceutica NV, Beerse, Belgium

Approved with Reservations

https://doi.org/10.5256/f1000research.16119.r36046

Summary:

The paper "autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network" presents an enhanced implementation of HGPEC, the previous work of one of the co-authors ... Continue reading

Summary:

The paper "autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network" presents an enhanced implementation of HGPEC, the previous work of one of the co-authors of the paper. The new implementation allows users to integrate the data analysis steps in Cytoscape with other data analysis pipelines in R or CyREST API. Indeed, this new feature would be useful as users now can take the advantages of the network-based data analysis and visualization in Cytoscape as well as the power of statistical data analysis of R, for example.

Below are my detail comments:

Is the rationale for developing the new software tool clearly explained?

In my opinion, the "automatic features" is not well explained in the paper. The first place where the authors introduce the concept of "automatic features" is the last sentence of the first paragraph in the Introduction section. However, there is no further explain of this concept. Hence, it is very easy for people in the machine learning community to be confused with the concept of automatic feature selection in the automated machine learning field.

To clear the possible confusion, we can do two things: 1) add a citation of the paper/website where Cytoscape orginally introduce this concept; 2) briefly explain how Cytoscape provides this type of feature and how HGPEC can leaverage the facilities provided by Cystocape.
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

The user manual is quite detail. However, there are rooms for improvements of the presentation, for example, the space between pictures and paragraphs, and the ident of paragraph are not always consistent and pleasant to read. I highly recommend to use latex to produce the mamual.

There is a Vietnamese sentence on page 10 of the manual, which should be removed.
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly

- Please add a citation when mentioning that breast cancer is known to be associated with 12 genes (first paragraph, page 5)
- Please briefly explain why the results of the demonstration make sense
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Overall, the author should have further explain: what is automatic features and why it is worthy of investigation.

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

10

Reviewer Report 07 Aug 2018

Tin Nguyen, Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.16119.r36572

This paper aims to address the challenging task of trying to prioritize candidate genes and diseases based on their degree of relevance within a known heterogenous network of disease-gene and disease-disease associations. A previous iteration of this Cytoscape app called ... Continue reading

This paper aims to address the challenging task of trying to prioritize candidate genes and diseases based on their degree of relevance within a known heterogenous network of disease-gene and disease-disease associations. A previous iteration of this Cytoscape app called HGPEC employs a network learning-based method only for disease-gene associations however now it has been upgraded to autoHGPEC with added automation features, the flexibility to be integrated in more complex analysis pipelines and the ability to take input other data resources. Another addition is that autoHGPEC predicts not only disease-gene associations but disease-disease assocations as well. This paper details the application of autoHGPEC in predicting novel breast cancer-associated genes and diseases with increased automation and flexibility.

I commend the authors in the presentation of their work. The paper was logical, concise and easy to read. Below are my comments. I hope it will benefit them well.

Major comments:

It is mentioned that the app is highly dependent on a used heterogenous network, gene/protein interaction networks and known disease-gene associations. Firstly, clarify how these networks are provided and what exactly they are. Secondly, perhaps using gene ontology-based gene similarity networks could be better. This is crucial, as there could be post translational modification effects to be explored. However, with this said, the paper claims HGPC can prioritize candidate genes without knowing its molecular basis. Reconsider the reasoning behind this claim.

How is the input and output data reliable if we don’t know the molecular basis? Perhaps this is a dangerous statement, as there’s now lots of computational tools providing evidence to the pertinence of post translational modification effects such as methylation, acetylation, etc. Be careful how this is stated. It has current and will have future ramifications. At least provide an explanation as to why knowledge of this isn’t needed.
If autoHGPEC performs better than GPEC and PRINCIPLE, present data on it. Where’s the comparison between the three and what is the criteria for ranking them?
Has a comparison of results for when a set of candidate genes are selected from a gene network or chromosome been done? This is a flexibility-based component of the app however testing these cases could give evidence for which is more reliable and in what circumstances.

Minor comments

“Recently, we have developed a Cytoscape app, HGPEC, to predict both disease-gene and disease-disease associations based on a state-of-the-art method on a heterogeneous network of diseases and genes.” Be specific. I understand what the paper is trying to say here, with “state of the art” meaning either network or machine learning based approaches however it took a few times to get a clear understanding. Mention how pertinent these approaches are but afterwards, mention which state of the art approach is being used. It will get the point across quickly without introducing any confusion.
Technical and grammatical writing errors:

- “…we first refactor source code of HGPEC to implement Cytoscape Tunable annotations to replace control panels of HGPEC in the west” –
- “Beside the fact is that almost commands of autoHGPEC return results in JSON format…”
- “autoHGPEC is an upgradED version of HGPEC…”
- “For gene/protein networkS, user can import the network from various molecular interaction”

There are more throughout the paper and especially the supplementary document. Read through and correct carefully. Pay attention to your font formatting and spacing to keep things consistent.
Change the node colors on page 8 of the supplementary document. It is a little confusing to see red as top ranked, green as bottom ranked and white and light-green as middle ranked. Maybe provide a legend.
Bottom of page 10 – change column names for abbreviation for association. Pick something other than ass. Maybe assoc.

Overall, good job. Two crucial benefits to the new app is that it can take data in from multiple sources as opposed to only one source – possibly lowering the chance of error and bias, and that it lets you integrate it with R and Python, allowing for integration as a component of more complex analysis pipelines. However, it seems this app can only be used if known disease-gene/protein associations and disease similarity networks are given. Why doesn’t the paper mention protein interactions any further? Specifically, what pertinent information is being taken away from these said protein interactions?

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 24 May 2018

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 24 May 18	read	read

Tin Nguyen, University of Nevada, Reno, Reno, USA
Thanh Le Van, Janssen Pharmaceutica NV, Beerse, Belgium

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

7 Views

28 Aug 2018 | for Version 1

Thanh Le Van, Janssen Pharmaceutica NV, Beerse, Belgium

7 Views Cite this report Responses(0)

Approved With Reservations

Summary:

The paper "autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network" presents an enhanced implementation of HGPEC, the previous work of one of the co-authors of the paper. The new implementation allows users to integrate the data analysis steps in Cytoscape with other data analysis pipelines in R or CyREST API. Indeed, this new feature would be useful as users now can take the advantages of the network-based data analysis and visualization in Cytoscape as well as the power of statistical data analysis of R, for example.

Below are my detail comments:

Is the rationale for developing the new software tool clearly explained?

In my opinion, the "automatic features" is not well explained in the paper. The first place where the authors introduce the concept of "automatic features" is the last sentence of the first paragraph in the Introduction section. However, there is no further explain of this concept. Hence, it is very easy for people in the machine learning community to be confused with the concept of automatic feature selection in the automated machine learning field.

To clear the possible confusion, we can do two things: 1) add a citation of the paper/website where Cytoscape orginally introduce this concept; 2) briefly explain how Cytoscape provides this type of feature and how HGPEC can leaverage the facilities provided by Cystocape.
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

The user manual is quite detail. However, there are rooms for improvements of the presentation, for example, the space between pictures and paragraphs, and the ident of paragraph are not always consistent and pleasant to read. I highly recommend to use latex to produce the mamual.

There is a Vietnamese sentence on page 10 of the manual, which should be removed.
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly

- Please add a citation when mentioning that breast cancer is known to be associated with 12 genes (first paragraph, page 5)
- Please briefly explain why the results of the demonstration make sense
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Overall, the author should have further explain: what is automatic features and why it is worthy of investigation.

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

10 Views

07 Aug 2018 | for Version 1

Tin Nguyen, Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA

10 Views Cite this report Responses(0)

Approved With Reservations

This paper aims to address the challenging task of trying to prioritize candidate genes and diseases based on their degree of relevance within a known heterogenous network of disease-gene and disease-disease associations. A previous iteration of this Cytoscape app called HGPEC employs a network learning-based method only for disease-gene associations however now it has been upgraded to autoHGPEC with added automation features, the flexibility to be integrated in more complex analysis pipelines and the ability to take input other data resources. Another addition is that autoHGPEC predicts not only disease-gene associations but disease-disease assocations as well. This paper details the application of autoHGPEC in predicting novel breast cancer-associated genes and diseases with increased automation and flexibility.

I commend the authors in the presentation of their work. The paper was logical, concise and easy to read. Below are my comments. I hope it will benefit them well.

Major comments:

It is mentioned that the app is highly dependent on a used heterogenous network, gene/protein interaction networks and known disease-gene associations. Firstly, clarify how these networks are provided and what exactly they are. Secondly, perhaps using gene ontology-based gene similarity networks could be better. This is crucial, as there could be post translational modification effects to be explored. However, with this said, the paper claims HGPC can prioritize candidate genes without knowing its molecular basis. Reconsider the reasoning behind this claim.

How is the input and output data reliable if we don’t know the molecular basis? Perhaps this is a dangerous statement, as there’s now lots of computational tools providing evidence to the pertinence of post translational modification effects such as methylation, acetylation, etc. Be careful how this is stated. It has current and will have future ramifications. At least provide an explanation as to why knowledge of this isn’t needed.
If autoHGPEC performs better than GPEC and PRINCIPLE, present data on it. Where’s the comparison between the three and what is the criteria for ranking them?
Has a comparison of results for when a set of candidate genes are selected from a gene network or chromosome been done? This is a flexibility-based component of the app however testing these cases could give evidence for which is more reliable and in what circumstances.

Minor comments

“Recently, we have developed a Cytoscape app, HGPEC, to predict both disease-gene and disease-disease associations based on a state-of-the-art method on a heterogeneous network of diseases and genes.” Be specific. I understand what the paper is trying to say here, with “state of the art” meaning either network or machine learning based approaches however it took a few times to get a clear understanding. Mention how pertinent these approaches are but afterwards, mention which state of the art approach is being used. It will get the point across quickly without introducing any confusion.
Technical and grammatical writing errors:

- “…we first refactor source code of HGPEC to implement Cytoscape Tunable annotations to replace control panels of HGPEC in the west” –
- “Beside the fact is that almost commands of autoHGPEC return results in JSON format…”
- “autoHGPEC is an upgradED version of HGPEC…”
- “For gene/protein networkS, user can import the network from various molecular interaction”

There are more throughout the paper and especially the supplementary document. Read through and correct carefully. Pay attention to your font formatting and spacing to keep things consistent.
Change the node colors on page 8 of the supplementary document. It is a little confusing to see red as top ranked, green as bottom ranked and white and light-green as middle ranked. Maybe provide a legend.
Bottom of page 10 – change column names for abbreviation for association. Pick something other than ass. Maybe assoc.

Overall, good job. Two crucial benefits to the new app is that it can take data in from multiple sources as opposed to only one source – possibly lowering the chance of error and bias, and that it lets you integrate it with R and Python, allowing for integration as a component of more complex analysis pipelines. However, it seems this app can only be used if known disease-gene/protein associations and disease similarity networks are given. Why doesn’t the paper mention protein interactions any further? Specifically, what pertinent information is being taken away from these said protein interactions?

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Barabási AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011; 12(1): 56–68. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Wang X, Gulbahce N, Yu H: Network-based methods for human disease gene prediction. Brief Funct Genomics. 2011; 10(5): 280–293. PubMed Abstract | Publisher Full Text

[3] 3. Li Y, Patra JC: Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network. Bioinformatics. 2010; 26(9): 1219–1224. PubMed Abstract | Publisher Full Text

[4] 4. Chen Y, Jiang T, Jiang R: Uncover disease genes by maximizing information flow in the phenome-interactome network. Bioinformatics. 2011; 27(13): i167–i176. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Guo X, Gao L, Wei C, et al.: A computational method based on the integration of heterogeneous networks for predicting disease-gene associations. PLoS One. 2011; 6(9): e24171. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Le DH, Nguyen MH: Towards more realistic machine learning techniques for prediction of disease-associated genes. In: Proceedings of the Sixth International Symposium on Information and Communication Technology; Hue City, Viet Nam. 2833269: ACM. 2015; 116–120. Publisher Full Text

[7] 7. Le DH, Xuan Hoai N, Kwon YK: A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction. In: Knowledge and Systems Engineering. Edited by Nguyen V-H, Le A-C, Huynh V-N, Springer International Publishing; 2015; 326: 577–588. Publisher Full Text

[8] 8. Oti M, Ballouz S, Wouters MA: Web tools for the prioritization of candidate disease genes. In: In Silico Tools for Gene Discovery. Methods Mol Biol. 2011; 760: 189–206. PubMed Abstract | Publisher Full Text

[9] 9. Tranchevent LC, Capdevila FB, Nitsch D, et al.: A guide to web tools to prioritize candidate genes. Brief Bioinform. 2011; 12(1): 22–32. PubMed Abstract | Publisher Full Text

[10] 10. Moreau Y, Tranchevent LC: Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet. 2012; 13(8): 523–536. PubMed Abstract | Publisher Full Text

[11] 11. Shannon P, Markiel A, Ozier O, et al.: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13(11): 2498–2504. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Le DH, Pham VH: HGPEC: a Cytoscape app for prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network. BMC Syst Biol. 2017; 11(1): 61. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Le DH, Kwon YK: GPEC: A Cytoscape plug-in for random walk-based gene prioritization and biomedical evidence collection. Comput Biol Chem. 2012; 37: 17–23. PubMed Abstract | Publisher Full Text

[14] 14. Gottlieb A, Magger O, Berman I, et al.: PRINCIPLE: a tool for associating genes with diseases via network propagation. Bioinformatics. 2011; 27(23): 3325–3326. PubMed Abstract | Publisher Full Text

[15] 15. Ashburner M, Ball CA, Blake JA, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000; 25(1): 25–29. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Schriml LM, Arze C, Nadendla S, et al.: Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res. 2012; 40(Database issue): D940–D946. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Kanehisa M, Goto S, Furumichi M, et al.: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010; 38(Database issue): D355–D360. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Mitchell JA, Aronson AR, Mork JG, et al.: Gene indexing: characterization and analysis of NLM's GeneRIFs.AMIA Annu Symp Proc. 2003; 2003: 460–4. PubMed Abstract | Free Full Text

[19] 19. Sayers EW, Barrett T, Benson DA, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011; 39(Database issue): D38–D51. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Ruepp A, Brauner B, Dunger-Kaltenbach I, et al.: CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res. 2008; 36(Database issue): D646–D650. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Amberger J, Bocchini CA, Scott AF, et al.: McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res. 2009; 37(Database issue): D793–D796. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. Jiang R, Gan M, He P: Constructing a gene semantic similarity network for the inference of disease genes. BMC Syst Biol. 2011; 5 Suppl 2: S2. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Köhler S, Doelken SC, Mungall CJ, et al.: The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014; 42(Database issue): D966–D974. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Le DH, Dang VT: Ontology-based disease similarity network for disease gene prediction. Vietnam Journal of Computer Science. 2016; 3(3): 197–205. Publisher Full Text

[25] 25. Le DH: Disease phenotype similarity improves the prediction of novel disease-associated microRNAs. In: Information and Computer Science (NICS), 2015 2nd National Foundation for Science and Technology Development Conference on: 16–18 Sept. 2015. 2015; 76–81. Publisher Full Text

[26] 26. Le DH, Dao LTM: Annotating diseases using human phenotype ontology improves prediction of disease-associated long non-coding RNAs. J Mol Biol. 2018; pii: S0022-2836(18)30401-7. PubMed Abstract

[27] 27. Li J, Gong B, Chen X, et al.: DOSim: An R package for similarity between diseases based on Disease Ontology. BMC Bioinformatics. 2011; 12(1): 266. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Deng Y, Gao L, Wang B, et al.: HPOSim: an R package for phenotypic similarity measure and enrichment analysis based on the human phenotype ontology. PLoS One. 2015; 10(2): e0115692. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Dennis G Jr, Sherman BT, Hosack DA, et al.: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003; 4(9): R60. Publisher Full Text | Free Full Text

[30] 30. Subramanian A, Tamayo P, Mootha VK, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005; 102(43): 15545–15550. PubMed Abstract | Publisher Full Text | Free Full Text

[31] 31. trangtran86: trangtran86/autoHGPEC: First commit (Version 1.0). Zenodo. 2018. Data Source

autoHGPEC: Automated prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network

Abstract

Keywords

Introduction

Methods

RWRH-based method

Implementation

Operation

Figure 1. Workflow of autoHGPEC in three environments (i.e., Menu-based Cytoscape, CyREST Command API and R).

Use cases

Figure 2. Visualization of highly ranked candidate genes and diseases in topological relationships with breast cancer.

Discussion and conclusions

Summary

Software and data availability

Competing interests

Grant information

Supplementary material

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated