Skip to main content
Top

2020 | OriginalPaper | Chapter

Graph Based Automatic Protein Function Annotation Improved by Semantic Similarity

Authors : Bishnu Sarker, Navya Khare, Marie-Dominique Devignes, Sabeur Aridhi

Published in: Bioinformatics and Biomedical Engineering

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Functional annotation of protein is a very challenging task primarily because manual annotation requires a great amount of human efforts and still it’s nearly impossible to keep pace with the exponentially growing number of protein sequences coming into the public databases, thanks to the high throughput sequencing technology. For example, the UniProt Knowledge-base (UniProtKB) is currently the largest and most comprehensive resource for protein sequence and annotation data. According to the November, 2019 release of UniProtKB, some 561,000 sequences are manually reviewed but over 150 million sequences lack reviewed functional annotations. Moreover, it is an expensive deal in terms of the cost it incurs and the time it takes. On the contrary, exploiting this huge quantity of data is important to understand life at the molecular level, and is central to understanding human disease processes and drug discovery. To be useful, protein sequences need to be annotated with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology (GO) terms. The ability to automatically annotate protein sequences in UniProtKB/TrEMBL, the non-reviewed UniProt sequence repository, would represent a major step towards bridging the gap between annotated and un-annotated protein sequences. In this paper, we extend a neighborhood based network inference technique for automatic GO annotation using protein similarity graph built on protein domain and family information. The underlying philosophy of our approach assumes that proteins can be linked through the domains, families, and superfamilies that they share. We propose an efficient pruning and post-processing technique by integrating semantic similarity of GO terms. We show by empirical results that the proposed hierarchical post-processing potentially improves the performance of other GO annotation tools as well.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)PubMedPubMedCentralCrossRef Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)PubMedPubMedCentralCrossRef
2.
go back to reference Arakaki, A.K., Huang, Y., Skolnick, J.: EFICAz 2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinformatics 10(1), 107 (2009)PubMedPubMedCentralCrossRef Arakaki, A.K., Huang, Y., Skolnick, J.: EFICAz 2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinformatics 10(1), 107 (2009)PubMedPubMedCentralCrossRef
4.
go back to reference Bakheet, T.M., Doig, A.J.: Properties and identification of human protein drug targets. Bioinformatics 25(4), 451–457 (2009)PubMedCrossRef Bakheet, T.M., Doig, A.J.: Properties and identification of human protein drug targets. Bioinformatics 25(4), 451–457 (2009)PubMedCrossRef
5.
go back to reference Barabási, A.L.: Linked: The New Science of Networks. Perseus Books Group. ISBN 9780738206677 Barabási, A.L.: Linked: The New Science of Networks. Perseus Books Group. ISBN 9780738206677
6.
7.
go back to reference Cai, C., Han, L., Ji, Z.L., Chen, X., Chen, Y.Z.: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13), 3692–3697 (2003)PubMedPubMedCentralCrossRef Cai, C., Han, L., Ji, Z.L., Chen, X., Chen, Y.Z.: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13), 3692–3697 (2003)PubMedPubMedCentralCrossRef
8.
go back to reference Cai, C., Han, L., Ji, Z., Chen, Y.: Enzyme family classification by support vector machines. Proteins Struct. Funct. Bioinf. 55(1), 66–76 (2004)CrossRef Cai, C., Han, L., Ji, Z., Chen, Y.: Enzyme family classification by support vector machines. Proteins Struct. Funct. Bioinf. 55(1), 66–76 (2004)CrossRef
9.
go back to reference Cai, Y.D., Chou, K.C.: Predicting enzyme subclass by functional domain composition and pseudo amino acid composition. J. Proteome Res. 4(3), 967–971 (2005)PubMedCrossRef Cai, Y.D., Chou, K.C.: Predicting enzyme subclass by functional domain composition and pseudo amino acid composition. J. Proteome Res. 4(3), 967–971 (2005)PubMedCrossRef
10.
go back to reference Chua, H.N., Sung, W.K., Wong, L.: Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions. Bioinformatics 22(13), 1623–1630 (2006)PubMedCrossRef Chua, H.N., Sung, W.K., Wong, L.: Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions. Bioinformatics 22(13), 1623–1630 (2006)PubMedCrossRef
11.
go back to reference Conesa, A., Götz, S., García-Gómez, J.M., Terol, J., Talón, M., Robles, M.: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18), 3674–3676 (2005)PubMedCrossRef Conesa, A., Götz, S., García-Gómez, J.M., Terol, J., Talón, M., Robles, M.: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18), 3674–3676 (2005)PubMedCrossRef
12.
go back to reference UniProt Consortium: UniProt: a hub for protein information. Nucleic Acids Res. 43(Database issue), D204–D212 (2015) UniProt Consortium: UniProt: a hub for protein information. Nucleic Acids Res. 43(Database issue), D204–D212 (2015)
13.
go back to reference De Ferrari, L., Aitken, S., van Hemert, J., Goryanin, I.: EnzML: multi-label prediction of enzyme classes using interpro signatures. BMC Bioinformatics 13(1), 61 (2012)PubMedPubMedCentralCrossRef De Ferrari, L., Aitken, S., van Hemert, J., Goryanin, I.: EnzML: multi-label prediction of enzyme classes using interpro signatures. BMC Bioinformatics 13(1), 61 (2012)PubMedPubMedCentralCrossRef
14.
go back to reference Dobson, P.D., Doig, A.J.: Predicting enzyme class from protein structure without alignments. J. Mol. Biol. 345(1), 187–199 (2005)PubMedCrossRef Dobson, P.D., Doig, A.J.: Predicting enzyme class from protein structure without alignments. J. Mol. Biol. 345(1), 187–199 (2005)PubMedCrossRef
15.
go back to reference Gattiker, A., et al.: Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27(1), 49–58 (2003)PubMedCrossRef Gattiker, A., et al.: Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27(1), 49–58 (2003)PubMedCrossRef
16.
go back to reference Gong, Q., Ning, W., Tian, W.: GOFDR: a sequence alignment based method for predicting protein functions. Methods 93, 3–14 (2016)PubMedCrossRef Gong, Q., Ning, W., Tian, W.: GOFDR: a sequence alignment based method for predicting protein functions. Methods 93, 3–14 (2016)PubMedCrossRef
17.
go back to reference Hishigaki, H., et al.: Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast 18(6), 523–531 (2001)PubMedCrossRef Hishigaki, H., et al.: Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast 18(6), 523–531 (2001)PubMedCrossRef
18.
go back to reference Huang, W.L., Chen, H.M., Hwang, S.F., Ho, S.Y.: Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosystems 90(2), 405–413 (2007)PubMedCrossRef Huang, W.L., Chen, H.M., Hwang, S.F., Ho, S.Y.: Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosystems 90(2), 405–413 (2007)PubMedCrossRef
19.
go back to reference des Jardins, M., Karp, P.D., Krummenacker, M., Lee, T.J., Ouzounis, C.A.: Prediction of enzyme classification from protein sequence without the use of sequence similarity. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 92–99 (1997)PubMed des Jardins, M., Karp, P.D., Krummenacker, M., Lee, T.J., Ouzounis, C.A.: Prediction of enzyme classification from protein sequence without the use of sequence similarity. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 92–99 (1997)PubMed
20.
22.
go back to reference Koskinen, P., Törönen, P., Nokso-Koivisto, J., Holm, L.: PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment. Bioinformatics 31(10), 1544–1552 (2015)PubMedCrossRef Koskinen, P., Törönen, P., Nokso-Koivisto, J., Holm, L.: PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment. Bioinformatics 31(10), 1544–1552 (2015)PubMedCrossRef
23.
go back to reference Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the c4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10), 920–926 (2001)PubMedCrossRef Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the c4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10), 920–926 (2001)PubMedCrossRef
24.
go back to reference Kulmanov, M., Hoehndorf, R.: DeepGOplus: improved protein function prediction from sequence. Bioinformatics 36(2), 422–429 (2020)PubMed Kulmanov, M., Hoehndorf, R.: DeepGOplus: improved protein function prediction from sequence. Bioinformatics 36(2), 422–429 (2020)PubMed
25.
go back to reference Kulmanov, M., Khan, M.A., Hoehndorf, R.: DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34(4), 660–668 (2017)PubMedCentralCrossRef Kulmanov, M., Khan, M.A., Hoehndorf, R.: DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34(4), 660–668 (2017)PubMedCentralCrossRef
26.
go back to reference Kumar, N., Skolnick, J.: EFICAz2.5: application of a high-precision enzyme function predictor to 396 proteomes. Bioinformatics 28(20), 2687–2688 (2012)PubMedPubMedCentralCrossRef Kumar, N., Skolnick, J.: EFICAz2.5: application of a high-precision enzyme function predictor to 396 proteomes. Bioinformatics 28(20), 2687–2688 (2012)PubMedPubMedCentralCrossRef
27.
go back to reference Li, Y., et al.: DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018)PubMedCrossRef Li, Y., et al.: DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018)PubMedCrossRef
28.
go back to reference Li, Y.H., et al.: SVM-Prot 2016: a web-server for machine learning prediction of protein functional families from sequence irrespective of similarity. PLoS ONE 11(8), e0155290 (2016)PubMedPubMedCentralCrossRef Li, Y.H., et al.: SVM-Prot 2016: a web-server for machine learning prediction of protein functional families from sequence irrespective of similarity. PLoS ONE 11(8), e0155290 (2016)PubMedPubMedCentralCrossRef
29.
go back to reference Lu, L., Qian, Z., Cai, Y.D., Li, Y.: ECS: an automatic enzyme classifier based on functional domain composition. Comput. Biol. Chem. 31(3), 226–232 (2007)PubMedCrossRef Lu, L., Qian, Z., Cai, Y.D., Li, Y.: ECS: an automatic enzyme classifier based on functional domain composition. Comput. Biol. Chem. 31(3), 226–232 (2007)PubMedCrossRef
30.
go back to reference Medlar, A.J., Törönen, P., Zosa, E., Holm, L.: PANNZER 2: annotate a complete proteome in minutes!. Nucleic Acids Res. 43, W24–W29 (2018) Medlar, A.J., Törönen, P., Zosa, E., Holm, L.: PANNZER 2: annotate a complete proteome in minutes!. Nucleic Acids Res. 43, W24–W29 (2018)
31.
32.
go back to reference Nabieva, E., et al.: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21(suppl\(\_\)1), i302–i310 (2005) Nabieva, E., et al.: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21(suppl\(\_\)1), i302–i310 (2005)
33.
go back to reference Nagao, C., Nagano, N., Mizuguchi, K.: Prediction of detailed enzyme functions and identification of specificity determining residues by random forests. PLoS ONE 9(1), e84623 (2014)PubMedPubMedCentralCrossRef Nagao, C., Nagano, N., Mizuguchi, K.: Prediction of detailed enzyme functions and identification of specificity determining residues by random forests. PLoS ONE 9(1), e84623 (2014)PubMedPubMedCentralCrossRef
34.
go back to reference Nasibov, E., Kandemir-Cavas, C.: Efficiency analysis of KNN and minimum distance-based classifiers in enzyme family prediction. Comput. Biol. Chem. 33(6), 461–464 (2009)PubMedCrossRef Nasibov, E., Kandemir-Cavas, C.: Efficiency analysis of KNN and minimum distance-based classifiers in enzyme family prediction. Comput. Biol. Chem. 33(6), 461–464 (2009)PubMedCrossRef
35.
36.
go back to reference Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986) Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
39.
go back to reference Roy, A., Yang, J., Zhang, Y.: COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 40(W1), W471–W477 (2012)PubMedPubMedCentralCrossRef Roy, A., Yang, J., Zhang, Y.: COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 40(W1), W471–W477 (2012)PubMedPubMedCentralCrossRef
40.
41.
go back to reference Sarker, B., Ritchie, D.W., Aridhi, S.: Functional annotation of proteins using domain embedding based sequence classification. In: Proceedings of 11th International Conference on Knowledge Discovery and Information Retrieval, Vienna, Austria, pp. 163–170 (2019) Sarker, B., Ritchie, D.W., Aridhi, S.: Functional annotation of proteins using domain embedding based sequence classification. In: Proceedings of 11th International Conference on Knowledge Discovery and Information Retrieval, Vienna, Austria, pp. 163–170 (2019)
42.
go back to reference Schwikowski, B., Uetz, P., Fields, S.: A network of protein–protein interactions in yeast. Nat. Biotechnol. 18(12), 1257 (2000)PubMedCrossRef Schwikowski, B., Uetz, P., Fields, S.: A network of protein–protein interactions in yeast. Nat. Biotechnol. 18(12), 1257 (2000)PubMedCrossRef
43.
go back to reference Shen, H.B., Chou, K.C.: EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. Biochem. Biophys. Res. Commun. 364(1), 53–59 (2007)PubMedCrossRef Shen, H.B., Chou, K.C.: EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. Biochem. Biophys. Res. Commun. 364(1), 53–59 (2007)PubMedCrossRef
44.
go back to reference Tian, W., Arakaki, A.K., Skolnick, J.: EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res. 32(21), 6226–6239 (2004)PubMedPubMedCentralCrossRef Tian, W., Arakaki, A.K., Skolnick, J.: EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res. 32(21), 6226–6239 (2004)PubMedPubMedCentralCrossRef
45.
go back to reference Volpato, V., Adelfio, A., Pollastri, G.: Accurate prediction of protein enzymatic class by N-to-1 neural networks. BMC Bioinformatics 14(1), S11 (2013)PubMedPubMedCentralCrossRef Volpato, V., Adelfio, A., Pollastri, G.: Accurate prediction of protein enzymatic class by N-to-1 neural networks. BMC Bioinformatics 14(1), S11 (2013)PubMedPubMedCentralCrossRef
47.
go back to reference Yu, C., Zavaljevski, N., Desai, V., Reifman, J.: Genome-wide enzyme annotation with precision control: catalytic families (CatFam) databases. Proteins Struct. Funct. Bioinf. 74(2), 449–460 (2009)CrossRef Yu, C., Zavaljevski, N., Desai, V., Reifman, J.: Genome-wide enzyme annotation with precision control: catalytic families (CatFam) databases. Proteins Struct. Funct. Bioinf. 74(2), 449–460 (2009)CrossRef
48.
go back to reference Zhang, C., Freddolino, P.L., Zhang, Y.: COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic Acids Res. 45(W1), W291–W299 (2017)PubMedPubMedCentralCrossRef Zhang, C., Freddolino, P.L., Zhang, Y.: COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic Acids Res. 45(W1), W291–W299 (2017)PubMedPubMedCentralCrossRef
49.
go back to reference Zhang, C., Zheng, W., Freddolino, P.L., Zhang, Y.: MetaGO: predicting gene ontology of non-homologous proteins through low-resolution protein structure prediction and protein–protein network mapping. J. Mol. Biol. 430(15), 2256–2265 (2018)PubMedPubMedCentralCrossRef Zhang, C., Zheng, W., Freddolino, P.L., Zhang, Y.: MetaGO: predicting gene ontology of non-homologous proteins through low-resolution protein structure prediction and protein–protein network mapping. J. Mol. Biol. 430(15), 2256–2265 (2018)PubMedPubMedCentralCrossRef
51.
52.
go back to reference Zhou, N., et al.: The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens, p. 653105. bioRxiv (2019) Zhou, N., et al.: The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens, p. 653105. bioRxiv (2019)
Metadata
Title
Graph Based Automatic Protein Function Annotation Improved by Semantic Similarity
Authors
Bishnu Sarker
Navya Khare
Marie-Dominique Devignes
Sabeur Aridhi
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-45385-5_24

Premium Partner