Abstract
When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence features. However, there are two issues to consider before successful functional prediction can take place: identifying discriminatory features, and overcoming the challenge of a large imbalance in the training data. We show that by applying feature subset selection followed by undersampling of the majority class, significantly better support vector machine (SVM) classifiers are generated compared with standard machine learning approaches. As well as revealing that the features selected could have the potential to advance our understanding of the relationship between sequence and function, we also show that undersampling to produce fully balanced data significantly improves performance. The best discriminating ability is achieved using SVMs together with feature selection and full undersampling; this approach strongly outperforms other competitive learning algorithms. We conclude that this combined approach can generate powerful machine learning classifiers for predicting protein function directly from sequence.
Similar content being viewed by others
References
Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389–402
Whisstock JC, Lesk AM. Prediction of protein function from protein sequence and structure. Q Rev Biophys 2003; 36(3): 307–40
King RD, Karwath A, Clare A, et al. Accurate prediction of protein functional class in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast 2000; 17: 283–93
Jensen R, Gupta H, Staerfeldt H, et al. Prediction of human protein function according to Gene Ontology categories. Bioinformatics 2003; 19: 635–42
Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intell Data Anal J 2002; 6(5): 429–49
Drummond C, Holte RC. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II; 2003 Jul 21; Washington, DC
Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Comput Intell 2004; 20(1): 18–36
An A, Cercone N, Huang X. A case study for learning from imbalanced data sets: advances in artificial intelligence. Proceedings of the 14th Conference of the Canadian Society for Computational Studies of Intelligence; 2001 June; Ottawa (ON). London: Springer-Verlag, 2001: 1–15
John GH, Langley P. Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal (QC): Morgan Kaufmann, 1995: 399–406
Quinlan JR. C4.5: programs for machine learning. San Mateo (CA): Morgan Kaufmann, 1993
World Health Organization. Global prevalence and incidence of selected curable sexually transmitted infections, overview and estimates. Geneva: World Health Organization, 2001
Arvidson CG, Powers T, Walter P, et al. Neisseria gonorrhoeae PilA is an FtsY homolog. J Bacteriol 1999; 181: 731–9
Riley M. Functions of the gene products of Escherichia coli. Microbiol Rev 1993; 57: 862–952
Bishop CM. Neural networks for pattern recognition. Oxford: Oxford University Press, 1993
Coissac E, Maillier E, Netter P. A comparative study of duplications in bacteria and eukaryotes: the importance of telomeres. Mol Biol Evol 1997; 14(10): 1062–74
Vapnik V. Statistical learning theory. New York: John Wiley & Sons Inc., 1998
Platt JC. Fast training of support vector machines using sequential minimal optimisation. In: Scholkopf B, Burges C, Smola A, editors. Advances in kernel methods: support vector learning. Cambridge (MA): MIT Press, 1999: 185–208
Witten IH, Frank E. Data mining: practical machine learning tools and techniques with java implementations. San Francisco (CA): Morgan Kaufmann, 2000
Dash M, Liu H. Feature selection for classification. Intell Data Anal J 1997; 1(3): 131–56
Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell 1997; 97(1–2): 273–324
Doak J. An evaluation of feature selection methods and their application to computer security. Davis (CA): Department of Computing Science, University of California at Davis, 1992. Technical report no.: CSE-92-18
John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. Proceedings of the 11 th International Conference on Machine Learning. San Mateo (CA): Morgan Kaufmann, 1994: 121–9
Weiss GM, Provost F. The effect of class distribution on classifier learning: an empirical study. New Jersey: Department of Computer Science, Rutgers University, 2001. Technical report no.: ML-TR-44
Al-Shahib A, He C, Tan AC, et al. An assessment of feature relevance in predicting protein function from sequence. In: Proceedings of the Fifth International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’04). Lecture Notes in Computer Science. Volume 3177. Exeter: Springer-Verlag, 2004: 52–7
Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975; 12: 387–415
Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem 1996; 20: 25–33
Huang J, Jingjing L, Charles X. Comparing naive Bayes, decision trees, and SVM with AUC and accuracy. 3rd IEEE International Conference on Data Mining (ICDM 2003); 2003 Nov 19–22: Melbourne (FL)
Guyon I, Gupta H. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157–82
Saeys Y, Degroeve S, Aeyels D, et al. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. Bioinformatics 2003; 19Suppl. 2: ii179–88
Weiss GM. The effect of class disjuncts and class distribution on decision tree learning. New Jersey: Department of Computer Science, Rutgers University, 2003
Tan AC. Applications of ensemble machine learning to Bioinformatics [PhD thesis]. Glasgow: Department of Computing Science, University of Glasgow, 2004
Acknowledgments
We thank Mark Girolami, Aik Choon Tan and Simon Rogers for their comments and feedback throughout this research. Ali Al-Shahib is funded by The University of Glasgow and Rainer Breitling is supported by a BBSRC grant (17/GG17989).
The authors have no conflicts of interest that are directly relevant to the content of this article.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Al-Shahib, A., Breitling, R. & Gilbert, D. Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence. Appl-Bioinformatics 4, 195–203 (2005). https://doi.org/10.2165/00822942-200504030-00004
Published:
Issue Date:
DOI: https://doi.org/10.2165/00822942-200504030-00004