Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence

Al-Shahib, Ali; Breitling, Rainer; Gilbert, David

doi:10.2165/00822942-200504030-00004

Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence

Original Research Article
Published: 22 August 2012

Volume 4, pages 195–203, (2005)
Cite this article

Applied Bioinformatics

Ali Al-Shahib¹,
Rainer Breitling^1,2 &
David Gilbert¹

75 Accesses
74 Citations
Explore all metrics

Abstract

When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence features. However, there are two issues to consider before successful functional prediction can take place: identifying discriminatory features, and overcoming the challenge of a large imbalance in the training data. We show that by applying feature subset selection followed by undersampling of the majority class, significantly better support vector machine (SVM) classifiers are generated compared with standard machine learning approaches. As well as revealing that the features selected could have the potential to advance our understanding of the relationship between sequence and function, we also show that undersampling to produce fully balanced data significantly improves performance. The best discriminating ability is achieved using SVMs together with feature selection and full undersampling; this approach strongly outperforms other competitive learning algorithms. We conclude that this combined approach can generate powerful machine learning classifiers for predicting protein function directly from sequence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

References

Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389–402
Article PubMed CAS Google Scholar
Whisstock JC, Lesk AM. Prediction of protein function from protein sequence and structure. Q Rev Biophys 2003; 36(3): 307–40
Article PubMed CAS Google Scholar
King RD, Karwath A, Clare A, et al. Accurate prediction of protein functional class in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast 2000; 17: 283–93
Article PubMed CAS Google Scholar
Jensen R, Gupta H, Staerfeldt H, et al. Prediction of human protein function according to Gene Ontology categories. Bioinformatics 2003; 19: 635–42
Article PubMed CAS Google Scholar
Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intell Data Anal J 2002; 6(5): 429–49
Google Scholar
Drummond C, Holte RC. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II; 2003 Jul 21; Washington, DC
Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Comput Intell 2004; 20(1): 18–36
Article Google Scholar
An A, Cercone N, Huang X. A case study for learning from imbalanced data sets: advances in artificial intelligence. Proceedings of the 14th Conference of the Canadian Society for Computational Studies of Intelligence; 2001 June; Ottawa (ON). London: Springer-Verlag, 2001: 1–15
Google Scholar
John GH, Langley P. Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal (QC): Morgan Kaufmann, 1995: 399–406
Google Scholar
Quinlan JR. C4.5: programs for machine learning. San Mateo (CA): Morgan Kaufmann, 1993
Google Scholar
World Health Organization. Global prevalence and incidence of selected curable sexually transmitted infections, overview and estimates. Geneva: World Health Organization, 2001
Google Scholar
Arvidson CG, Powers T, Walter P, et al. Neisseria gonorrhoeae PilA is an FtsY homolog. J Bacteriol 1999; 181: 731–9
PubMed CAS Google Scholar
Riley M. Functions of the gene products of Escherichia coli. Microbiol Rev 1993; 57: 862–952
PubMed CAS Google Scholar
Bishop CM. Neural networks for pattern recognition. Oxford: Oxford University Press, 1993
Google Scholar
Coissac E, Maillier E, Netter P. A comparative study of duplications in bacteria and eukaryotes: the importance of telomeres. Mol Biol Evol 1997; 14(10): 1062–74
Article PubMed CAS Google Scholar
Vapnik V. Statistical learning theory. New York: John Wiley & Sons Inc., 1998
Google Scholar
Platt JC. Fast training of support vector machines using sequential minimal optimisation. In: Scholkopf B, Burges C, Smola A, editors. Advances in kernel methods: support vector learning. Cambridge (MA): MIT Press, 1999: 185–208
Google Scholar
Witten IH, Frank E. Data mining: practical machine learning tools and techniques with java implementations. San Francisco (CA): Morgan Kaufmann, 2000
Google Scholar
Dash M, Liu H. Feature selection for classification. Intell Data Anal J 1997; 1(3): 131–56
Article Google Scholar
Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell 1997; 97(1–2): 273–324
Article Google Scholar
Doak J. An evaluation of feature selection methods and their application to computer security. Davis (CA): Department of Computing Science, University of California at Davis, 1992. Technical report no.: CSE-92-18
Google Scholar
John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem. Proceedings of the 11 th International Conference on Machine Learning. San Mateo (CA): Morgan Kaufmann, 1994: 121–9
Google Scholar
Weiss GM, Provost F. The effect of class distribution on classifier learning: an empirical study. New Jersey: Department of Computer Science, Rutgers University, 2001. Technical report no.: ML-TR-44
Google Scholar
Al-Shahib A, He C, Tan AC, et al. An assessment of feature relevance in predicting protein function from sequence. In: Proceedings of the Fifth International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’04). Lecture Notes in Computer Science. Volume 3177. Exeter: Springer-Verlag, 2004: 52–7
Google Scholar
Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975; 12: 387–415
Article Google Scholar
Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem 1996; 20: 25–33
Article PubMed CAS Google Scholar
Huang J, Jingjing L, Charles X. Comparing naive Bayes, decision trees, and SVM with AUC and accuracy. 3rd IEEE International Conference on Data Mining (ICDM 2003); 2003 Nov 19–22: Melbourne (FL)
Guyon I, Gupta H. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157–82
Google Scholar
Saeys Y, Degroeve S, Aeyels D, et al. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. Bioinformatics 2003; 19Suppl. 2: ii179–88
Article PubMed Google Scholar
Weiss GM. The effect of class disjuncts and class distribution on decision tree learning. New Jersey: Department of Computer Science, Rutgers University, 2003
Google Scholar
Tan AC. Applications of ensemble machine learning to Bioinformatics [PhD thesis]. Glasgow: Department of Computing Science, University of Glasgow, 2004
Google Scholar

Download references

Acknowledgments

We thank Mark Girolami, Aik Choon Tan and Simon Rogers for their comments and feedback throughout this research. Ali Al-Shahib is funded by The University of Glasgow and Rainer Breitling is supported by a BBSRC grant (17/GG17989).

The authors have no conflicts of interest that are directly relevant to the content of this article.

Author information

Authors and Affiliations

Bioinformatics Research Centre, Department of Computing Science, University of Glasgow, Glasgow, UK
Ali Al-Shahib, Rainer Breitling & David Gilbert
Molecular Plant Science Group, Institute of Biomedical and Life Sciences, University of Glasgow, Glasgow, UK
Rainer Breitling

Authors

Ali Al-Shahib
View author publications
You can also search for this author in PubMed Google Scholar
Rainer Breitling
View author publications
You can also search for this author in PubMed Google Scholar
David Gilbert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Al-Shahib.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Shahib, A., Breitling, R. & Gilbert, D. Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence. Appl-Bioinformatics 4, 195–203 (2005). https://doi.org/10.2165/00822942-200504030-00004

Download citation

Published: 22 August 2012
Issue Date: September 2005
DOI: https://doi.org/10.2165/00822942-200504030-00004

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation