Skip to main content

2015 | OriginalPaper | Buchkapitel

PDP-RF: Protein Domain Boundary Prediction Using Random Forest Classifier

verfasst von : Piyali Chatterjee, Subhadip Basu, Julian Zubek, Mahantapas Kundu, Mita Nasipuri, Dariusz Plewczynski

Erschienen in: Pattern Recognition and Machine Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The Domain Boundary Prediction is a crucial task for functional classification of proteins, homology-based protein structure prediction and for high-throughput structural genomics. Each amino acid is represented using a set of physico-chemical properties. Random Forest Classifier is explored for accurate prediction of domain regions by training on the curated dataset obtained from CATH database. The software is tested on proteins of CASP-6, CASP-8, CASP-9 and CASP-10 targets in order to evaluate its prediction accuracy using three fold cross validation experiments. Finally, a consensus approach is used to combine results of the classifiers obtained through the cross-validation experiments. The average recall and precision scores achieved by the developed consensus based Random Forest classifiers (PDP-RF) are 0.98 and 0.88 respectively for prediction of CASP targets. The overall accuracy and F-scores of the PDP-RF are observed as 0.87 and 0.91 respectively.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Mount, D.: Bioinformatics: Sequence and Genome Analysis, p. 416. Cold Spring Harbor Laboratory Press, New York (2004) Mount, D.: Bioinformatics: Sequence and Genome Analysis, p. 416. Cold Spring Harbor Laboratory Press, New York (2004)
2.
Zurück zum Zitat Melnik, B.S., Galzitskaya, O.V.: Prediction of protein domain boundaries from sequence alone. Protein Sci. 12, 696–701 (2003)CrossRef Melnik, B.S., Galzitskaya, O.V.: Prediction of protein domain boundaries from sequence alone. Protein Sci. 12, 696–701 (2003)CrossRef
3.
Zurück zum Zitat Suyama, M., Ohara, O.: Domcut: prediction of inter-domain linker regions in amino acid sequences. Bioinformatics 19, 673–674 (2003)CrossRef Suyama, M., Ohara, O.: Domcut: prediction of inter-domain linker regions in amino acid sequences. Bioinformatics 19, 673–674 (2003)CrossRef
4.
Zurück zum Zitat Liu, J., Rost, B.: Sequence-based prediction of protein domains. Nucleic Acids Res. 32, 3522–3530 (2004)CrossRef Liu, J., Rost, B.: Sequence-based prediction of protein domains. Nucleic Acids Res. 32, 3522–3530 (2004)CrossRef
5.
Zurück zum Zitat Dumontier, M., Yao, R., Feldman, H.J., Hoque, C.W.: Armadillo: domain boundary prediction by amino acid composition. J. Mol. Biol. 350, 1061–1073 (2005)CrossRef Dumontier, M., Yao, R., Feldman, H.J., Hoque, C.W.: Armadillo: domain boundary prediction by amino acid composition. J. Mol. Biol. 350, 1061–1073 (2005)CrossRef
6.
Zurück zum Zitat Sim, J., Kim, S.Y., Lee, J.: PPRODO: prediction of protein domain boundaries using neural networks. Proteins. 59, 627–632 (2005)CrossRef Sim, J., Kim, S.Y., Lee, J.: PPRODO: prediction of protein domain boundaries using neural networks. Proteins. 59, 627–632 (2005)CrossRef
7.
Zurück zum Zitat Cheng, J., Sweredoski, M.J., Baldi, P.: DOMpro: Protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Min. Knowl. Discov. 13, 1–10 (2006)CrossRefMathSciNet Cheng, J., Sweredoski, M.J., Baldi, P.: DOMpro: Protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Min. Knowl. Discov. 13, 1–10 (2006)CrossRefMathSciNet
8.
Zurück zum Zitat Sikder, A.R., Zomaya, A.Y.: Improving the performance of domaindiscovery of protein domain boundary assignment using inter-domain linker index. BMC Bioinformatics. 7(Suppl 5), S6 (2006)CrossRef Sikder, A.R., Zomaya, A.Y.: Improving the performance of domaindiscovery of protein domain boundary assignment using inter-domain linker index. BMC Bioinformatics. 7(Suppl 5), S6 (2006)CrossRef
9.
Zurück zum Zitat Gewehr, J.E., Zimmer, R.: SSEP-Domain: Protein domain prediction by alignment of secondary structure elements and profiles. Bioinformatics 22, 181–187 (2006)CrossRef Gewehr, J.E., Zimmer, R.: SSEP-Domain: Protein domain prediction by alignment of secondary structure elements and profiles. Bioinformatics 22, 181–187 (2006)CrossRef
10.
Zurück zum Zitat Cheng, J.: DOMAC: An accurate, hybrid protein domain prediction server. Nucleic Acids Res. 35, W354–W356 (2007)CrossRef Cheng, J.: DOMAC: An accurate, hybrid protein domain prediction server. Nucleic Acids Res. 35, W354–W356 (2007)CrossRef
11.
Zurück zum Zitat Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRef Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRef
12.
Zurück zum Zitat Yoo, P.D., Sikder, A.R., Taheri, J., Zhou, B.B., Zomaya, A.Y.: DomNet: protein domain boundary prediction using enhanced general regression network and new profiles. NanoBioSci. IEEE Trans. 7, 172–181 (2008)CrossRef Yoo, P.D., Sikder, A.R., Taheri, J., Zhou, B.B., Zomaya, A.Y.: DomNet: protein domain boundary prediction using enhanced general regression network and new profiles. NanoBioSci. IEEE Trans. 7, 172–181 (2008)CrossRef
13.
Zurück zum Zitat Bondugula, R., Lee, M.S., Wallqvist, A.: FIFEDom: a transparent domain boundary recognition system using a fuzzy mean operator. Nucleic Acids Res. 37, 452–462 (2009)CrossRef Bondugula, R., Lee, M.S., Wallqvist, A.: FIFEDom: a transparent domain boundary recognition system using a fuzzy mean operator. Nucleic Acids Res. 37, 452–462 (2009)CrossRef
14.
Zurück zum Zitat Eickholt, J., Deng, X., Cheng, J.: DoBo: protein domain boundary prediction by integrating evolutionary signals and machine learning. BMC Bioinformatics 12, 43 (2011)CrossRef Eickholt, J., Deng, X., Cheng, J.: DoBo: protein domain boundary prediction by integrating evolutionary signals and machine learning. BMC Bioinformatics 12, 43 (2011)CrossRef
15.
Zurück zum Zitat Ebina, T., Toh, H., Kuroda, Y.: DROP: an SVM domain linker predictor trained with optimal features selected by random forest. Bioinformatics 27, 487–494 (2011)CrossRef Ebina, T., Toh, H., Kuroda, Y.: DROP: an SVM domain linker predictor trained with optimal features selected by random forest. Bioinformatics 27, 487–494 (2011)CrossRef
16.
Zurück zum Zitat Zhang, X.Y., Lu, L.J., Song, Q., Yang, Q.Q., Li, D.P., Sun, J.M., Li, T.H., Cong, P.S.: DomHR: accurately identifying domain boundaries in proteins using a hinge region strategy. PLoS One 8, e60559 (2013)CrossRef Zhang, X.Y., Lu, L.J., Song, Q., Yang, Q.Q., Li, D.P., Sun, J.M., Li, T.H., Cong, P.S.: DomHR: accurately identifying domain boundaries in proteins using a hinge region strategy. PLoS One 8, e60559 (2013)CrossRef
17.
Zurück zum Zitat Sadowski, M.I.: Prediction of protein domain boundaries from inverse covariances. Proteins 81, 253–260 (2013)CrossRef Sadowski, M.I.: Prediction of protein domain boundaries from inverse covariances. Proteins 81, 253–260 (2013)CrossRef
18.
Zurück zum Zitat Xue, Z., Xu, D., Wang, Y., Zhang, Y.: ThreaDom : extracting protein domain boundary information from multiple threading alignments. Bioinformatics 29, 247–256 (2013)CrossRef Xue, Z., Xu, D., Wang, Y., Zhang, Y.: ThreaDom : extracting protein domain boundary information from multiple threading alignments. Bioinformatics 29, 247–256 (2013)CrossRef
19.
Zurück zum Zitat Galzitskaya, O.V., Dovidchenko, N.V., Lobanov, M., Garbuzinskii, S.A.: Prediction of protein domain boundaries from statistics of appearance of amino acid residues. Mol. Biol (Mosk). 40, 96–107 (2006)CrossRef Galzitskaya, O.V., Dovidchenko, N.V., Lobanov, M., Garbuzinskii, S.A.: Prediction of protein domain boundaries from statistics of appearance of amino acid residues. Mol. Biol (Mosk). 40, 96–107 (2006)CrossRef
20.
Zurück zum Zitat Kawashima, S., Ogata, H., Kanehisa, M.: AAindex: amino acid index database. Nucleic Acids Res. 27, 368–369 (1999)CrossRef Kawashima, S., Ogata, H., Kanehisa, M.: AAindex: amino acid index database. Nucleic Acids Res. 27, 368–369 (1999)CrossRef
21.
Zurück zum Zitat Wyrwicz, L.S., Koczyk, G., Rychlewski, L., Plewczynski, D.: ProteinSplit: splitting of multi-domain proteins using prediction of ordered and disordered regions in protein sequences for virtual structural genomics. J. Phys. Condens. Matter 19, 285222 (2007)CrossRef Wyrwicz, L.S., Koczyk, G., Rychlewski, L., Plewczynski, D.: ProteinSplit: splitting of multi-domain proteins using prediction of ordered and disordered regions in protein sequences for virtual structural genomics. J. Phys. Condens. Matter 19, 285222 (2007)CrossRef
22.
Zurück zum Zitat Sickmeier, M., Hamilton, J.A., LeGall, T., Vacic, V., Cortese, M.S., Tantos, A., Szabo, B., P, Tompa, Chen, J., Uversky, V.N., Obradovic, Z., Dunker, A.K.: DisProt: The database of disordered proteins. Nucleic Acids Res. 35, D786–93 (2007)CrossRef Sickmeier, M., Hamilton, J.A., LeGall, T., Vacic, V., Cortese, M.S., Tantos, A., Szabo, B., P, Tompa, Chen, J., Uversky, V.N., Obradovic, Z., Dunker, A.K.: DisProt: The database of disordered proteins. Nucleic Acids Res. 35, D786–93 (2007)CrossRef
23.
Zurück zum Zitat Bu, Z., Callaway, D.J.: Proteins move! protein dynamics and long range allostery in cell signaling. Adv. Protein Chem. Struct. Biol. 83, 163–221 (2011) Bu, Z., Callaway, D.J.: Proteins move! protein dynamics and long range allostery in cell signaling. Adv. Protein Chem. Struct. Biol. 83, 163–221 (2011)
24.
Zurück zum Zitat Cordes, M.H., Davidson, A.R., Sauer, R.T.: Sequence space, folding and protein design. Curr. Opin. Struct. Biol. 6, 3–10 (1996)CrossRef Cordes, M.H., Davidson, A.R., Sauer, R.T.: Sequence space, folding and protein design. Curr. Opin. Struct. Biol. 6, 3–10 (1996)CrossRef
26.
Zurück zum Zitat Yang, P., Yang, Y.H., Zhou, B.B., Zomaya, A.Y.: A review of ensemble methods in bioinformatics. Curr. Bioinform. 5, 296–308 (2010)CrossRef Yang, P., Yang, Y.H., Zhou, B.B., Zomaya, A.Y.: A review of ensemble methods in bioinformatics. Curr. Bioinform. 5, 296–308 (2010)CrossRef
27.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MATHMathSciNet Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MATHMathSciNet
28.
Zurück zum Zitat Moult, J., Fidelis, K., Rost, B., Hubbard, T., Tramontano, A.: Critical assessment of methods of protein structure prediction (CASP)–round 6. Proteins. 61(Suppl 7), 3–7 (2005)CrossRef Moult, J., Fidelis, K., Rost, B., Hubbard, T., Tramontano, A.: Critical assessment of methods of protein structure prediction (CASP)–round 6. Proteins. 61(Suppl 7), 3–7 (2005)CrossRef
29.
Zurück zum Zitat Moult, J., Fidelis, K., Kryshtafovych, A., Rost, B., Tramontano, A.: Critical assessment of methods of protein structure prediction (CASP)-round VIII. Proteins 77, 1–4 (2009)CrossRef Moult, J., Fidelis, K., Kryshtafovych, A., Rost, B., Tramontano, A.: Critical assessment of methods of protein structure prediction (CASP)-round VIII. Proteins 77, 1–4 (2009)CrossRef
30.
Zurück zum Zitat Moult, J., Fidelis, K., Kryshtafovych, A.: Critical assessment of methods of protein structure prediction (CASP)–round IX. Proteins. 79(Suppl 10), 1–5 (2011)CrossRef Moult, J., Fidelis, K., Kryshtafovych, A.: Critical assessment of methods of protein structure prediction (CASP)–round IX. Proteins. 79(Suppl 10), 1–5 (2011)CrossRef
31.
Zurück zum Zitat Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., Tramontano, A.: Critical assessment of methods of protein structure prediction (CASP)-round X. Proteins. 82(Suppl 1), 1–6 (2014)CrossRef Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., Tramontano, A.: Critical assessment of methods of protein structure prediction (CASP)-round X. Proteins. 82(Suppl 1), 1–6 (2014)CrossRef
32.
Zurück zum Zitat Plewczynski, D., Basu, S., Saha, I.: AMS 4.0: consensus prediction of post-translational modifications in protein sequences. Amino Acids 43, 573–582 (2012)CrossRef Plewczynski, D., Basu, S., Saha, I.: AMS 4.0: consensus prediction of post-translational modifications in protein sequences. Amino Acids 43, 573–582 (2012)CrossRef
33.
Zurück zum Zitat Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E.L., Tate, J., Punta, M.: Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014)CrossRef Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E.L., Tate, J., Punta, M.: Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014)CrossRef
Metadaten
Titel
PDP-RF: Protein Domain Boundary Prediction Using Random Forest Classifier
verfasst von
Piyali Chatterjee
Subhadip Basu
Julian Zubek
Mahantapas Kundu
Mita Nasipuri
Dariusz Plewczynski
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-19941-2_42

Premium Partner