Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

In silico prediction of physical protein interactions and characterization of interactome orphans

This article has been updated

Abstract

Protein-protein interactions (PPIs) are useful for understanding signaling cascades, predicting protein function, associating proteins with disease and fathoming drug mechanism of action. Currently, only 10% of human PPIs may be known, and about one-third of human proteins have no known interactions. We introduce FpClass, a data mining–based method for proteome-wide PPI prediction. At an estimated false discovery rate of 60%, we predicted 250,498 PPIs among 10,531 human proteins; 10,647 PPIs involved 1,089 proteins without known interactions. We experimentally tested 233 high- and medium-confidence predictions and validated 137 interactions, including seven novel putative interactors of the tumor suppressor p53. Compared to previous PPI prediction methods, FpClass achieved better agreement with experimentally detected PPIs. We provide an online database of annotated PPI predictions (http://ophid.utoronto.ca/fpclass/) and the prediction software (http://www.cs.utoronto.ca/~juris/data/fpclass/).

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: FpClass workflow.
Figure 2: Evaluation of FpClass using experimental PPI data sets.
Figure 3: Experimental validation of FpClass predictions.
Figure 4: Properties of d0 (orphan) proteins and genes.

Similar content being viewed by others

Change history

  • 10 December 2014

    In the version of this article initially published online, an author (G.B.M.) was incorrectly listed twice. The error has been corrected for the print, PDF and HTML versions of this article.

References

  1. Cusick, M.E. et al. Literature-curated protein interaction datasets. Nat. Methods 6, 39–46 (2009).

    Article  CAS  Google Scholar 

  2. Pastrello, C. et al. Integration, visualization and analysis of human interactome. Biochem. Biophys. Res. Commun. 445, 757–773 (2014).

    Article  CAS  Google Scholar 

  3. Bork, P. et al. Protein interaction networks from yeast to human. Curr. Opin. Struct. Biol. 14, 292–299 (2004).

    Article  CAS  Google Scholar 

  4. Stumpf, M.P. et al. Estimating the size of the human interactome. Proc. Natl. Acad. Sci. USA 105, 6959–6964 (2008).

    Article  CAS  Google Scholar 

  5. Venkatesan, K. et al. An empirical framework for binary interactome mapping. Nat. Methods 6, 83–90 (2009).

    Article  CAS  Google Scholar 

  6. Edwards, A.M. et al. Too many roads not taken. Nature 470, 163–165 (2011).

    Article  CAS  Google Scholar 

  7. Braun, P. et al. An experimentally derived confidence score for binary protein-protein interactions. Nat. Methods 6, 91–97 (2009).

    Article  CAS  Google Scholar 

  8. Brückner, A., Polge, C., Lentze, N., Auerbach, D. & Schlattner, U. Yeast two-hybrid, a powerful tool for systems biology. Int. J. Mol. Sci. 10, 2763–2788 (2009).

    Article  Google Scholar 

  9. Wodak, S.J., Pu, S., Vlasblom, J. & Séraphin, B. Challenges and rewards of interaction proteomics. Mol. Cell. Proteomics 8, 3–18 (2009).

    Article  CAS  Google Scholar 

  10. Schwartz, A.S., Yu, J., Gardenour, K.R., Finley, R.L. Jr. & Ideker, T. Cost-effective strategies for completing the interactome. Nat. Methods 6, 55–61 (2009).

    Article  CAS  Google Scholar 

  11. Rhodes, D.R. et al. Probabilistic model of the human protein-protein interaction network. Nat. Biotechnol. 23, 951–959 (2005).

    Article  CAS  Google Scholar 

  12. Scott, M.S. & Barton, G.J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007).

    Article  Google Scholar 

  13. Kim, J.H. & Pearl,, J. in Proc. IJCAI 190–193 (Morgan Kaufmann, 1983).

  14. Petschnigg, J. et al. The mammalian-membrane two-hybrid assay (MaMTH) for probing membrane-protein interactions in human cells. Nat. Methods 11, 585–592 (2014).

    Article  CAS  Google Scholar 

  15. Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol. Cell. Proteomics 10, M111.010629 (2011).

    Article  Google Scholar 

  16. Zhang, Q.C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556–560 (2012).

    Article  CAS  Google Scholar 

  17. D'haeseleer, P. & Church, G.M. in Proc. IEEE Comput. Syst. Bioinform. Conf. 216–223 (IEEE, 2004).

  18. Kang, H.S. et al. NABP1, a novel RORg-regulated gene encoding a single-stranded nucleic-acid-binding protein. Biochem. J. 397, 89–99 (2006).

    Article  CAS  Google Scholar 

  19. Krokeide, S.Z. et al. Human NEIL3 is mainly a monofunctional DNA glycosylase removing spiroimindiohydantoin and guanidinohydantoin. DNA Repair (Amst.) 12, 1159–1164 (2013).

    Article  CAS  Google Scholar 

  20. Menendez, D., Inga, A. & Resnick, M.A. The expanding universe of p53 targets. Nat. Rev. Cancer 9, 724–737 (2009).

    Article  CAS  Google Scholar 

  21. Wang, W. et al. Identification of rare DNA variants in mitochondrial disorders with improved array-based sequencing. Nucleic Acids Res. 39, 44–58 (2011).

    Article  CAS  Google Scholar 

  22. Vaseva, A.V. & Moll, U.M. The mitochondrial p53 pathway. Biochim. Biophys. Acta 1787, 414–420 (2009).

    Article  CAS  Google Scholar 

  23. Gordon, S., Akopyan, G., Garban, H. & Bonavida, B. Transcription factor YY1: structure, function, and therapeutic implications in cancer biology. Oncogene 25, 1125–1142 (2006).

    Article  CAS  Google Scholar 

  24. Tanikawa, C. et al. Regulation of protein citrullination through p53/PADI4 network in DNA damage response. Cancer Res. 69, 8761–8769 (2009).

    Article  CAS  Google Scholar 

  25. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    Article  CAS  Google Scholar 

  26. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 37, D211–D215 (2009).

    Article  CAS  Google Scholar 

  27. Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat. Rev. Drug Discov. 5, 821–834 (2006).

    Article  CAS  Google Scholar 

  28. Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. USA 101, 6062–6067 (2004).

    Article  CAS  Google Scholar 

  29. Roth, R.B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, 67–80 (2006).

    Article  CAS  Google Scholar 

  30. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    Article  CAS  Google Scholar 

  31. Krupp, M. et al. RNA-Seq Atlas—a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, 1184–1185 (2012).

    Article  CAS  Google Scholar 

  32. Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 28, 1248–1250 (2010).

    Article  CAS  Google Scholar 

  33. The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 38, D142–D148 (2010).

  34. Brown, K.R. & Jurisica, I. Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol. 8, R95 (2007).

    Article  Google Scholar 

  35. Piccinin, S. et al. A “twist box” code of p53 inactivation: twist box: p53 interaction promotes p53 degradation. Cancer Cell 22, 404–415 (2012).

    Article  CAS  Google Scholar 

  36. Hupp, T.R., Hayward, R.L. & Vojtesek, B. Strategies for p53 reactivation in human sarcoma. Cancer Cell 22, 283–285 (2012).

    Article  CAS  Google Scholar 

  37. Sprinzak, E. & Margalit, H. Correlated sequence-signatures as markers of protein-protein interaction. J. Mol. Biol. 311, 681–692 (2001).

    Article  CAS  Google Scholar 

  38. Zhang, Y. et al. Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information. BMC Med. Genomics 3, 1 (2010).

    Article  Google Scholar 

  39. Osborne, J.D. et al. Annotating the human genome with Disease Ontology. BMC Genomics 10 (suppl. 1), S6 (2009).

    Article  Google Scholar 

  40. Davis, A.P. et al. The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res. 39, D1067–D1072 (2011).

    Article  CAS  Google Scholar 

  41. Kotlyar, M., Fortney, K. & Jurisica, I. Network-based characterization of drug-regulated genes, drug targets, and toxicity. Methods 57, 499–507 (2012).

    Article  CAS  Google Scholar 

  42. Maglott, D., Ostell, J., Pruitt, K.D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 35, D26–D31 (2007).

    Article  CAS  Google Scholar 

  43. Hedges, S.B., Dudley, J. & Kumar, S. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics 22, 2971–2972 (2006).

    Article  CAS  Google Scholar 

  44. Toll-Riera, M. et al. Origin of primate orphan genes: a comparative genomics approach. Mol. Biol. Evol. 26, 603–612 (2009).

    Article  CAS  Google Scholar 

  45. Barshir, R. et al. The TissueNet database of human tissue protein-protein interactions. Nucleic Acids Res. 41, D841–D844 (2013).

    Article  CAS  Google Scholar 

  46. Birzele, F., Gewehr, J.E. & Zimmer, R. AutoPSI: a database for automatic structural classification of protein sequences and structures. Nucleic Acids Res. 36, D398–D401 (2008).

    Article  CAS  Google Scholar 

  47. Ward, J.J., Sodhi, J.S., McGuffin, L.J., Buxton, B.F. & Jones, D.T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 337, 635–645 (2004).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This research was supported by the grants from Genome Canada via the Ontario Genomics Institute, Ontario Research Fund (GL2-01-030, RE-03-020 to I.J.), the Canadian Institutes for Health Research (#99745, #93579 to I.J., A.J.), the Natural Sciences Research Council (#203475 to I.J.), US Army Department of Defense W81XWH-12-1-0501 (to I.J.), the Italian Association for Cancer Research, the Friuli Venezia-Giulia and CRO 5xmille Intramural Grant (to R.M.), the Friuli Venezia-Giulia Exchange Program (to C.P.), the Ontario Genomics Institute (#303547 to I.S.), the Canadian Institutes of Health Research (Catalyst-NHG99091, PPP-125785 to I.S.), the Canadian Cystic Fibrosis Foundation (#300348 to I.S.), the Canadian Cancer Society (2010-700406 to I.S.), Genentech and University Health Network (GL2-01-018 to I.S.), US National Institutes of Health (NIH) PO1/PPG grant 01CA0099031 (to G.B.M., I.J.) and NCI R21 CA126700 (to Z.D., G.B.M.). Computational resources were supported by grants from the Canada Foundation for Innovation (CFI #12301, #203373, #29272, #22540a, #30865) and IBM (to I.J.). I.J. is supported by the Canada Research Chair program. This research was also supported by the University of Toronto McLaughlin Centre and the Ontario Ministry of Health and Long-Term Care (OMOHLTC). The views expressed do not necessarily reflect those of the OMOHLTC. We thank M. Vidal, D. Hill, F. Roth and the Center for Cancer Systems Biology (Dana-Farber Cancer Institute) for prepublication release of protein interaction data, funded by NIH NHGRI grant R01 HG001715.

Author information

Authors and Affiliations

Authors

Contributions

M.K. and I.J. conceived of the project. M.K. developed the algorithm and executed computational analyses and validation. Additional validation and assay-related analyses were executed by C.P., C.C., Y.N., F.V. and F.B.-C. R.M., I.S., A.J. and G.B.M. provided guidance for biological validation experiments that were executed by F.P., A.L.S., H.L., C.P., T.N. and Z.D. M.K. and I.J. wrote the initial manuscript, and all authors were involved in results presentation, discussion and preparation of the final manuscript.

Corresponding author

Correspondence to Igor Jurisica.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Estimating FDR of PPI predictions.

(a-b) We used the approach of D'Haeseleer and Church1 to estimate FDR. This approach calculates the FDR of a PPI dataset, D, by analyzing intersections among three PPI datasets, D, R, and D′, where R is a reference set of trusted PPIs and D′ is a set of PPIs from a method similar to that of D. It is assumed that the overlap of any two datasets contains largely true positive PPIs. The number of non-overlapping true positives, IV, is calculated from the numbers of shared PPIs: IV = (II × III) / I. Then, the number of false positives, V, and the FDR are calculated. The FDR tends to be low if D has a high overlap with either D′ or R. (c)To calculate the FDR of FpClass we initially set D to our top 35,000 proteome-wide predictions, excluding any PPIs used in training; (we subsequently calculated FDR for larger sets of FpClass predictions (panels d-g)). We defined R as a set of experimentally detected interactions and D′ as the union of high confidence predictions from previous studies by Rhodes et al., 2005, Scott et al., 2007, Elefsinioti et al., 2011, and Zhang et al., 2012. Using a similar approach, we calculated FDRs for high-confidence predictions from these previous studies. For example, to calculate the FDR for Rhodes et al., we defined D as high-confidence predictions from that study, and D′ as the union of top FpClass predictions and high-confidence predictions from the three remaining previous studies. To ensure that estimated FDRs were not due to biases of a particular reference set, we repeated FDR calculations using 6 reference sets. We calculated FDRs using each reference set, except when the intersection of datasets D, D', and R comprised less than 5 PPIs. In such cases the FDR is indicated as NA. (d-g) Using the approach of D'Haeseleer and Church, we estimated FDRs of predicted networks of various sizes from FpClass and four previous prediction methods. The approach of D'Haeseleer and Church requires a trusted reference set of PPIs. We tried four ways of defining this set: (d) using six reference sets (panel c) individually, and then calculating the median of the six resulting FDR estimates, (e) using the union of PPIs from methods that detect direct interactions (Y2H and LUMIER reference sets), (f) using the union of our six reference sets, and (g) using the union of Y2H reference sets.

1D’haeseleer, P. & Church, G. M. Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf 216–223 (2004). 2Rhodes, D. R. et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23, 951–959 (2005). 3Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007). 4Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10, M111.010629 (2011). 5Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556–560 (2012).

Supplementary Figure 2 Experimental validation of PPI predictions.

(a) Predicted interactions tested by Co-IP assays.

(b-c) Predicted interactions tested by GST pull-down assays.

(d) Predicted interaction partners of p53 include some of its known partners and d0 proteins. The x-axis indicates the number of top predicted partners, ranked from 1 to 2377. The y-axis indicates the number of known partners and d0 proteins, among the top predicted partners.

Supplementary Figure 3 Top Gene Ontology (GO) categories among d0 genes.

(a-c) GO analysis includes genes without GO annotations. (d-f) GO analysis excludes genes without GO annotations. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR.

Supplementary Figure 4 Percentages of d0 proteins in drug-target classes and structural properties of d0 proteins.

(a) Main drug target classes and (b) receptor drug target classes, as defined by Imming et al.6. Dashed lines indicate the percentage of d0 proteins in the proteome. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (c) SCOP structural classes. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (d) Protein lengths from UniProt8 and (e) protein disorder, predicted with DISOPRED9. P-values for protein length and disorder were calculated by two-sided Mann-Whitney U tests.

6Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5, 821–834 (2006). 7Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D419–25 (2008). 8The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38, D142–8 (2010). 9Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, 635–645 (2004).

Supplementary Figure 5 Median and maximum expression of d0- and dk-encoding genes.

P-values were calculated by two-sided Mann-Whitney U tests. (a-d) Median expression of d0 and dk genes in healthy human tissues. Gene expression data was taken from (a) Su et al., 200410, (b) Roth et al., 200611, (c) Wang et al., 200812, and (d) Krupp et al., 201213. (e-h) Maximum expression of d0 and dk genes in the same datasets.

10Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, 6062–6067 (2004). 11Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, 67–80 (2006). 12Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008). 13Krupp, M. et al. RNA-Seq Atlas--a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, 1184–1185 (2012).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–5, Supplementary Tables 1–10 and Supplementary Note (PDF 2171 kb)

Supplementary Software

FpClass code (ZIP 239668 kb)

Supplementary Data 1

Positive cases in our largest training set. (TXT 279 kb)

Supplementary Data 2

Predicted probabilities for protein pairs from Braun et al., 2009 (TXT 8 kb)

Supplementary Data 3

Cross-validation data: protein pairs and predicted probabilities. (TXT 48197 kb)

Supplementary Data 4

Experimentally tested predictions. (TXT 12 kb)

Supplementary Data 5

Fp60 network: predicted interactions with estimated FDR of 60%. (TXT 8784 kb)

Supplementary Data 6

D0 proteins: human proteins without experimentally detected interactions in I2D 1.95. (TXT 49 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kotlyar, M., Pastrello, C., Pivetta, F. et al. In silico prediction of physical protein interactions and characterization of interactome orphans. Nat Methods 12, 79–84 (2015). https://doi.org/10.1038/nmeth.3178

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3178

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing