Skip to main content

2016 | OriginalPaper | Buchkapitel

Measuring the Stability of Feature Selection

verfasst von : Sarah Nogueira, Gavin Brown

Erschienen in: Machine Learning and Knowledge Discovery in Databases

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In feature selection algorithms, “stability” is the sensitivity of the chosen feature set to variations in the supplied training data. As such it can be seen as an analogous concept to the statistical variance of a predictor. However unlike variance, there is no unique definition of stability, with numerous proposed measures over 15 years of literature. In this paper, instead of defining a new measure, we start from an axiomatic point of view and identify what properties would be desirable. Somewhat surprisingly, we find that the simple Pearson’s correlation coefficient has all necessary properties, yet has somehow been overlooked in favour of more complex alternatives. Finally, we illustrate how the use of this measure in practice can provide better interpretability and more confidence in the model selection process. The data and software related to this paper are available at https://​github.​com/​nogueirs/​ECML2016.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
We therefore have a set of d correlated Bernoulli variables \((X_1,...,X_d)\).
 
2
\(\phi \) is not necessarily symmetric.
 
3
Sketches of proofs are given in the supplementary material available online at www.​cs.​man.​ac.​uk/​~nogueirs/​files/​supplementary-material-ECML-2016.​pdf.
 
4
Also called the Phi coefficient in this case since we are dealing with binary vectors.
 
5
You can reproduce these experiments in Matlab with the code given at https://​github.​com/​nogueirs/​ECML2016.
 
6
Here, the error is taken to be the negative log-likelihood, a measure of goodness-of-fit of the model. The lower the value, the better the model.
 
Literatur
1.
Zurück zum Zitat Alelyani, S., Zhao, Z., Liu, H.: A dilemma in assessing stability of feature selection algorithms. In: HPCC (2011) Alelyani, S., Zhao, Z., Liu, H.: A dilemma in assessing stability of feature selection algorithms. In: HPCC (2011)
2.
Zurück zum Zitat Altidor, W., Khoshgoftaar, T.M., Napolitano, A.: A noise-based stability evaluation of threshold-based feature selection techniques. In: IRI 2011 (2011) Altidor, W., Khoshgoftaar, T.M., Napolitano, A.: A noise-based stability evaluation of threshold-based feature selection techniques. In: IRI 2011 (2011)
3.
Zurück zum Zitat Boulesteix, A.L., Slawski, M.: Stability and aggregation of ranked gene lists. Briefings Bioinform. 10(5), 556–568 (2009)CrossRef Boulesteix, A.L., Slawski, M.: Stability and aggregation of ranked gene lists. Briefings Bioinform. 10(5), 556–568 (2009)CrossRef
4.
Zurück zum Zitat Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with sequential wrapper-based approaches to feature selection. Technical report, Journal of Machine Learning Research (2002) Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with sequential wrapper-based approaches to feature selection. Technical report, Journal of Machine Learning Research (2002)
5.
Zurück zum Zitat Edmundson, H.P.: A correlation coefficient for attributes or events. In: Proceedings Statistical Association Methods for Mechanized Documentation (1966) Edmundson, H.P.: A correlation coefficient for attributes or events. In: Proceedings Statistical Association Methods for Mechanized Documentation (1966)
6.
Zurück zum Zitat He, Z., Yu, W.: Review article: stable feature selection for biomarker discovery. Comput. Biol. Chem. 34, 215–225 (2010)MathSciNetCrossRef He, Z., Yu, W.: Review article: stable feature selection for biomarker discovery. Comput. Biol. Chem. 34, 215–225 (2010)MathSciNetCrossRef
7.
Zurück zum Zitat Jurman, G., Merler, S., Barla, A., Paoli, S., Galea, A., Furlanello, C.: Algebraic stability indicators for ranked lists in molecular profiling. Bioinform. 24(2), 258–264 (2008)CrossRef Jurman, G., Merler, S., Barla, A., Paoli, S., Galea, A., Furlanello, C.: Algebraic stability indicators for ranked lists in molecular profiling. Bioinform. 24(2), 258–264 (2008)CrossRef
8.
Zurück zum Zitat Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12(1), 95–116 (2007)CrossRef Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12(1), 95–116 (2007)CrossRef
9.
Zurück zum Zitat Kamkar, I., Gupta, S.K., Phung, D., Venkatesh, S.: Stable feature selection with support vector machines. In: Pfahringer, B., Renz, J. (eds.) AI 2015. LNCS (LNAI), vol. 9457, pp. 298–308. Springer, Heidelberg (2015). doi:10.1007/978-3-319-26350-2_26 CrossRef Kamkar, I., Gupta, S.K., Phung, D., Venkatesh, S.: Stable feature selection with support vector machines. In: Pfahringer, B., Renz, J. (eds.) AI 2015. LNCS (LNAI), vol. 9457, pp. 298–308. Springer, Heidelberg (2015). doi:10.​1007/​978-3-319-26350-2_​26 CrossRef
10.
Zurück zum Zitat Křížek, P., Kittler, J., Hlaváč, V.: Improving stability of feature selection methods. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 929–936. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74272-2_115 CrossRef Křížek, P., Kittler, J., Hlaváč, V.: Improving stability of feature selection methods. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 929–936. Springer, Heidelberg (2007). doi:10.​1007/​978-3-540-74272-2_​115 CrossRef
11.
Zurück zum Zitat Kuncheva, L.I.: A stability index for feature selection. In: Artificial Intelligence and Applications (2007) Kuncheva, L.I.: A stability index for feature selection. In: Artificial Intelligence and Applications (2007)
12.
Zurück zum Zitat Lee, H.W., Lawton, C., Na, Y.J., Yoon, S.: Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery. Stat. Appl. Genet. Mol. Biol. 12(2), 207–223 (2012)MathSciNet Lee, H.W., Lawton, C., Na, Y.J., Yoon, S.: Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery. Stat. Appl. Genet. Mol. Biol. 12(2), 207–223 (2012)MathSciNet
13.
Zurück zum Zitat Lustgarten, J.L., Gopalakrishnan, V., Visweswaran, S.: Measuring stability of feature selection in biomedical datasets. In: AMIA Annual Symposium Proceedings, vol. 2009, p. 406 (2009) Lustgarten, J.L., Gopalakrishnan, V., Visweswaran, S.: Measuring stability of feature selection in biomedical datasets. In: AMIA Annual Symposium Proceedings, vol. 2009, p. 406 (2009)
14.
Zurück zum Zitat MAQC consortium: The MicroArray quality control project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotech. 24, 1151–1161 (2006)CrossRef MAQC consortium: The MicroArray quality control project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotech. 24, 1151–1161 (2006)CrossRef
15.
Zurück zum Zitat Sechidis, K., Brown, G.: Markov blanket discovery in positive-unlabelled and semi-supervised data. In: ECML (2015) Sechidis, K., Brown, G.: Markov blanket discovery in positive-unlabelled and semi-supervised data. In: ECML (2015)
16.
Zurück zum Zitat Somol, P., Novovičová, J.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Trans. Pattern Anal. Mach. Intell. 32(11), 1921–1939 (2010)CrossRef Somol, P., Novovičová, J.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. IEEE Trans. Pattern Anal. Mach. Intell. 32(11), 1921–1939 (2010)CrossRef
17.
Zurück zum Zitat Wald, R., Khoshgoftaar, T.M., Napolitano, A.: Stability of filter- and wrapper-based feature subset selection. In: International Conference on Tools with Artificial Intelligence. IEEE Computer Society (2013) Wald, R., Khoshgoftaar, T.M., Napolitano, A.: Stability of filter- and wrapper-based feature subset selection. In: International Conference on Tools with Artificial Intelligence. IEEE Computer Society (2013)
18.
Zurück zum Zitat Woznica, A., Nguyen, P., Kalousis, A.: Model mining for robust feature selection. In: KDD (2012) Woznica, A., Nguyen, P., Kalousis, A.: Model mining for robust feature selection. In: KDD (2012)
19.
Zurück zum Zitat Yu, L., Ding, C.H.Q., Loscalzo, S.: Stable feature selection via dense feature groups. In: KDD (2008) Yu, L., Ding, C.H.Q., Loscalzo, S.: Stable feature selection via dense feature groups. In: KDD (2008)
20.
Zurück zum Zitat Yu, L., Han, Y., Berens, M.E.: Stable gene selection from microarray data via sample weighting. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(1), 262–272 (2012)CrossRef Yu, L., Han, Y., Berens, M.E.: Stable gene selection from microarray data via sample weighting. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(1), 262–272 (2012)CrossRef
21.
Zurück zum Zitat Zhang, M., Zhang, L., Zou, J., Yao, C., Xiao, H., Liu, Q., Wang, J., Wang, D., Wang, C., Guo, Z.: Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes. Bioinformatics 25(13), 1662–1668 (2009)CrossRef Zhang, M., Zhang, L., Zou, J., Yao, C., Xiao, H., Liu, Q., Wang, J., Wang, D., Wang, C., Guo, Z.: Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes. Bioinformatics 25(13), 1662–1668 (2009)CrossRef
Metadaten
Titel
Measuring the Stability of Feature Selection
verfasst von
Sarah Nogueira
Gavin Brown
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-46227-1_28

Premium Partner