Skip to main content
Log in

Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence

  • Original Research Article
  • Published:
Applied Bioinformatics

Abstract

When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence features. However, there are two issues to consider before successful functional prediction can take place: identifying discriminatory features, and overcoming the challenge of a large imbalance in the training data. We show that by applying feature subset selection followed by undersampling of the majority class, significantly better support vector machine (SVM) classifiers are generated compared with standard machine learning approaches. As well as revealing that the features selected could have the potential to advance our understanding of the relationship between sequence and function, we also show that undersampling to produce fully balanced data significantly improves performance. The best discriminating ability is achieved using SVMs together with feature selection and full undersampling; this approach strongly outperforms other competitive learning algorithms. We conclude that this combined approach can generate powerful machine learning classifiers for predicting protein function directly from sequence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Table I
Fig. 1
Table II
Table III
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389–402

    Article  PubMed  CAS  Google Scholar 

  2. Whisstock JC, Lesk AM. Prediction of protein function from protein sequence and structure. Q Rev Biophys 2003; 36(3): 307–40

    Article  PubMed  CAS  Google Scholar 

  3. King RD, Karwath A, Clare A, et al. Accurate prediction of protein functional class in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast 2000; 17: 283–93

    Article  PubMed  CAS  Google Scholar 

  4. Jensen R, Gupta H, Staerfeldt H, et al. Prediction of human protein function according to Gene Ontology categories. Bioinformatics 2003; 19: 635–42

    Article  PubMed  CAS  Google Scholar 

  5. Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intell Data Anal J 2002; 6(5): 429–49

    Google Scholar 

  6. Drummond C, Holte RC. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II; 2003 Jul 21; Washington, DC

  7. Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Comput Intell 2004; 20(1): 18–36

    Article  Google Scholar 

  8. An A, Cercone N, Huang X. A case study for learning from imbalanced data sets: advances in artificial intelligence. Proceedings of the 14th Conference of the Canadian Society for Computational Studies of Intelligence; 2001 June; Ottawa (ON). London: Springer-Verlag, 2001: 1–15

    Google Scholar 

  9. John GH, Langley P. Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal (QC): Morgan Kaufmann, 1995: 399–406

    Google Scholar 

  10. Quinlan JR. C4.5: programs for machine learning. San Mateo (CA): Morgan Kaufmann, 1993

    Google Scholar 

  11. World Health Organization. Global prevalence and incidence of selected curable sexually transmitted infections, overview and estimates. Geneva: World Health Organization, 2001

    Google Scholar 

  12. Arvidson CG, Powers T, Walter P, et al. Neisseria gonorrhoeae PilA is an FtsY homolog. J Bacteriol 1999; 181: 731–9

    PubMed  CAS  Google Scholar 

  13. Riley M. Functions of the gene products of Escherichia coli. Microbiol Rev 1993; 57: 862–952

    PubMed  CAS  Google Scholar 

  14. Bishop CM. Neural networks for pattern recognition. Oxford: Oxford University Press, 1993

    Google Scholar 

  15. Coissac E, Maillier E, Netter P. A comparative study of duplications in bacteria and eukaryotes: the importance of telomeres. Mol Biol Evol 1997; 14(10): 1062–74

    Article  PubMed  CAS  Google Scholar 

  16. Vapnik V. Statistical learning theory. New York: John Wiley & Sons Inc., 1998

    Google Scholar 

  17. Platt JC. Fast training of support vector machines using sequential minimal optimisation. In: Scholkopf B, Burges C, Smola A, editors. Advances in kernel methods: support vector learning. Cambridge (MA): MIT Press, 1999: 185–208

    Google Scholar 

  18. Witten IH, Frank E. Data mining: practical machine learning tools and techniques with java implementations. San Francisco (CA): Morgan Kaufmann, 2000

    Google Scholar 

  19. Dash M, Liu H. Feature selection for classification. Intell Data Anal J 1997; 1(3): 131–56

    Article  Google Scholar 

  20. Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell 1997; 97(1–2): 273–324

    Article  Google Scholar 

  21. Doak J. An evaluation of feature selection methods and their application to computer security. Davis (CA): Department of Computing Science, University of California at Davis, 1992. Technical report no.: CSE-92-18

    Google Scholar 

  22. John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. Proceedings of the 11 th International Conference on Machine Learning. San Mateo (CA): Morgan Kaufmann, 1994: 121–9

    Google Scholar 

  23. Weiss GM, Provost F. The effect of class distribution on classifier learning: an empirical study. New Jersey: Department of Computer Science, Rutgers University, 2001. Technical report no.: ML-TR-44

    Google Scholar 

  24. Al-Shahib A, He C, Tan AC, et al. An assessment of feature relevance in predicting protein function from sequence. In: Proceedings of the Fifth International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’04). Lecture Notes in Computer Science. Volume 3177. Exeter: Springer-Verlag, 2004: 52–7

    Google Scholar 

  25. Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975; 12: 387–415

    Article  Google Scholar 

  26. Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem 1996; 20: 25–33

    Article  PubMed  CAS  Google Scholar 

  27. Huang J, Jingjing L, Charles X. Comparing naive Bayes, decision trees, and SVM with AUC and accuracy. 3rd IEEE International Conference on Data Mining (ICDM 2003); 2003 Nov 19–22: Melbourne (FL)

  28. Guyon I, Gupta H. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157–82

    Google Scholar 

  29. Saeys Y, Degroeve S, Aeyels D, et al. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. Bioinformatics 2003; 19Suppl. 2: ii179–88

    Article  PubMed  Google Scholar 

  30. Weiss GM. The effect of class disjuncts and class distribution on decision tree learning. New Jersey: Department of Computer Science, Rutgers University, 2003

    Google Scholar 

  31. Tan AC. Applications of ensemble machine learning to Bioinformatics [PhD thesis]. Glasgow: Department of Computing Science, University of Glasgow, 2004

    Google Scholar 

Download references

Acknowledgments

We thank Mark Girolami, Aik Choon Tan and Simon Rogers for their comments and feedback throughout this research. Ali Al-Shahib is funded by The University of Glasgow and Rainer Breitling is supported by a BBSRC grant (17/GG17989).

The authors have no conflicts of interest that are directly relevant to the content of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Al-Shahib.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Shahib, A., Breitling, R. & Gilbert, D. Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence. Appl-Bioinformatics 4, 195–203 (2005). https://doi.org/10.2165/00822942-200504030-00004

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.2165/00822942-200504030-00004

Keywords

Navigation