Skip to main content
Erschienen in: Annals of Data Science 6/2022

25.03.2020

Monotonicity of the \(\chi ^2\)-statistic and Feature Selection

verfasst von: Firuz Kamalov, Ho Hon Leung, Sherif Moussa

Erschienen in: Annals of Data Science | Ausgabe 6/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Feature selection is an important preprocessing step in analyzing large scale data. In this paper, we prove the monotonicity property of the \(\chi ^2\)-statistic and use it to construct a more robust feature selection method. In particular, we show that \(\chi ^2_{Y, X_1} \le \chi ^2_{Y, (X_1, X_2)}\). This result indicates that a new feature should be added to an existing feature set only if it increases the \(\chi ^2\)-statistic beyond a certain threshold. Our stepwise feature selection algorithm significantly reduces the number of features considered at each stage making it more efficient than other similar methods. In addition, the selection process has a natural stopping point thus eliminating the need for user input. Numerical experiments confirm that the proposed algorithm can significantly reduce the number of features required for classification and improve classifier accuracy.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
All the experiments are performed using the scikit-learn library in Python [22].
 
Literatur
1.
Zurück zum Zitat Bryant F, Satorra A (2012) Principles and practice of scaled difference chi-square testing. Struct Equ Model Multidiscip J 19(3):372–398CrossRef Bryant F, Satorra A (2012) Principles and practice of scaled difference chi-square testing. Struct Equ Model Multidiscip J 19(3):372–398CrossRef
2.
Zurück zum Zitat Brown G et al (2012) Conditional likelihood maximization: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66 Brown G et al (2012) Conditional likelihood maximization: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
3.
Zurück zum Zitat Buza K (2014) Feedback prediction for blogs. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Machine learning and knowledge discovery, data analysis. Springer, Cham, pp 145–152CrossRef Buza K (2014) Feedback prediction for blogs. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Machine learning and knowledge discovery, data analysis. Springer, Cham, pp 145–152CrossRef
4.
Zurück zum Zitat Cios KJ, Kurgan LA (2001) SPECT heart data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA Cios KJ, Kurgan LA (2001) SPECT heart data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA
5.
Zurück zum Zitat Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1–2):155–176CrossRef Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1–2):155–176CrossRef
6.
Zurück zum Zitat Franke TM, Ho T, Christie CA (2011) The chi-square test. Am J Eval 33(3):448–458CrossRef Franke TM, Ho T, Christie CA (2011) The chi-square test. Am J Eval 33(3):448–458CrossRef
7.
Zurück zum Zitat Guyon I (2003) Design of experiments for the NIPS 2003 variable selection benchmark Guyon I (2003) Design of experiments for the NIPS 2003 variable selection benchmark
8.
Zurück zum Zitat Haddi E, Liu X, Shi Y (2013) The role of text pre-processing in sentiment analysis. Procedia Comput Sci 17:26–32CrossRef Haddi E, Liu X, Shi Y (2013) The role of text pre-processing in sentiment analysis. Procedia Comput Sci 17:26–32CrossRef
9.
Zurück zum Zitat Harder M, Salge C, Polani D (2013) Bivariate measure of redundant information. Phys Rev 87(1):012130 Harder M, Salge C, Polani D (2013) Bivariate measure of redundant information. Phys Rev 87(1):012130
10.
Zurück zum Zitat Hancer E, Xue B, Zhang M (2018) Differential evolution for filter feature selection based on information theory and feature ranking. Knowl Based Syst 140:103–119CrossRef Hancer E, Xue B, Zhang M (2018) Differential evolution for filter feature selection based on information theory and feature ranking. Knowl Based Syst 140:103–119CrossRef
11.
Zurück zum Zitat Jin X et al (2006) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In: International workshop on data mining for biomedical applications. Springer, Berlin Jin X et al (2006) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In: International workshop on data mining for biomedical applications. Springer, Berlin
12.
Zurück zum Zitat Kamalov F, Thabtah F (2017) A feature selection method based on ranked vector scores of features for classification. Ann Data Sci 4:1–20CrossRef Kamalov F, Thabtah F (2017) A feature selection method based on ranked vector scores of features for classification. Ann Data Sci 4:1–20CrossRef
13.
Zurück zum Zitat Kamalov F (2018) Sensitivity analysis for feature selection. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 1466–1470 Kamalov F (2018) Sensitivity analysis for feature selection. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 1466–1470
14.
Zurück zum Zitat Kamalov F, Thabtah F (2020) Outlier detection in high dimensional data. J Inf Knowl Manag 19(1):2040013 (15 pages)CrossRef Kamalov F, Thabtah F (2020) Outlier detection in high dimensional data. J Inf Knowl Manag 19(1):2040013 (15 pages)CrossRef
15.
Zurück zum Zitat Khoshgoftaar T, Gao K, Napolitano A, Wald R (2014) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf Syst Front 16(5):801–822CrossRef Khoshgoftaar T, Gao K, Napolitano A, Wald R (2014) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf Syst Front 16(5):801–822CrossRef
16.
Zurück zum Zitat Kononenko I, Cestnik B (1988) UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA Kononenko I, Cestnik B (1988) UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA
17.
Zurück zum Zitat Li C, Xu J (2019) Feature selection with the Fisher score followed by the Maximal Clique Centrality algorithm can accurately identify the hub genes of hepatocellular carcinoma. Sci Rep 9(2019):17283CrossRef Li C, Xu J (2019) Feature selection with the Fisher score followed by the Maximal Clique Centrality algorithm can accurately identify the hub genes of hepatocellular carcinoma. Sci Rep 9(2019):17283CrossRef
18.
Zurück zum Zitat Li Y, Luo C, Chung SM (2008) Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng 20(5):641–652CrossRef Li Y, Luo C, Chung SM (2008) Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng 20(5):641–652CrossRef
19.
Zurück zum Zitat Liu H, Motoda H (eds) (2007) Computational methods of feature selection. CRC Press, Boca Raton Liu H, Motoda H (eds) (2007) Computational methods of feature selection. CRC Press, Boca Raton
20.
Zurück zum Zitat Moh’d A, Mesleh A (2007) Chi square feature extraction based SVMs Arabic language text categorization system. J Comput Sci 3(6):430–435CrossRef Moh’d A, Mesleh A (2007) Chi square feature extraction based SVMs Arabic language text categorization system. J Comput Sci 3(6):430–435CrossRef
21.
Zurück zum Zitat Nene SA, Nayar SK, Murase H (1996) Columbia object image library (coil-20) Nene SA, Nayar SK, Murase H (1996) Columbia object image library (coil-20)
22.
Zurück zum Zitat Pedregosa F et al (2011) Scikit-learn: machine learning in Python. JMLR 12:2825–2830 Pedregosa F et al (2011) Scikit-learn: machine learning in Python. JMLR 12:2825–2830
23.
Zurück zum Zitat Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef
24.
Zurück zum Zitat Satorra A, Bentler PM (2010) Ensuring positiveness of the scaled difference chi-square test statistic. Psychometrika 75(2):243–248CrossRef Satorra A, Bentler PM (2010) Ensuring positiveness of the scaled difference chi-square test statistic. Psychometrika 75(2):243–248CrossRef
25.
Zurück zum Zitat Schlimmer J (1987) Mushroom data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA Schlimmer J (1987) Mushroom data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA
26.
Zurück zum Zitat Schlimmer J (1987) Congressional voting records data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA Schlimmer J (1987) Congressional voting records data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA
27.
Zurück zum Zitat Sun L, Zhang XY, Qian YH, Xu JC, Zhang SG, Tian Y (2019) Joint neighborhood entropy-based gene selection method with fisher score for tumor classification. Appl Intell 49(4):1245–1259CrossRef Sun L, Zhang XY, Qian YH, Xu JC, Zhang SG, Tian Y (2019) Joint neighborhood entropy-based gene selection method with fisher score for tumor classification. Appl Intell 49(4):1245–1259CrossRef
28.
Zurück zum Zitat Tang B, Kay S, He H (2016) Toward optimal feature selection in naive Bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521CrossRef Tang B, Kay S, He H (2016) Toward optimal feature selection in naive Bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521CrossRef
29.
Zurück zum Zitat Thabtah F, Kamalov F (2017) Phishing detection: a case analysis on classifiers with rules using machine learning. J Inf Knowl Manag 16(4):1–16 Thabtah F, Kamalov F (2017) Phishing detection: a case analysis on classifiers with rules using machine learning. J Inf Knowl Manag 16(4):1–16
30.
Zurück zum Zitat Thabtah F, Kamalov F, Rajab K (2018) A new computational intelligence approach to detect autistic features for autism screening. Int J Med Inform 117:112–124CrossRef Thabtah F, Kamalov F, Rajab K (2018) A new computational intelligence approach to detect autistic features for autism screening. Int J Med Inform 117:112–124CrossRef
31.
Zurück zum Zitat Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag 50(1):104–112CrossRef Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag 50(1):104–112CrossRef
32.
Zurück zum Zitat Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH (2018) Relief-based feature selection: introduction and review. J Biomed Inform 85:189–203CrossRef Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH (2018) Relief-based feature selection: introduction and review. J Biomed Inform 85:189–203CrossRef
33.
Zurück zum Zitat Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH (2018) Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform 85:168–188CrossRef Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH (2018) Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform 85:168–188CrossRef
34.
Zurück zum Zitat Voiculescu D (1993) The analogues of entropy and of Fisher’s information measure in free probability theory, I. Commun Math Phys 155(1):71–92CrossRef Voiculescu D (1993) The analogues of entropy and of Fisher’s information measure in free probability theory, I. Commun Math Phys 155(1):71–92CrossRef
35.
Zurück zum Zitat Wang Y, Ni S, Priestley J (2019) Improving risk modeling via feature selection, hyper-parameter adjusting, and model ensembling. Glob J Econ Finance 3(1):30–47 Wang Y, Ni S, Priestley J (2019) Improving risk modeling via feature selection, hyper-parameter adjusting, and model ensembling. Glob J Econ Finance 3(1):30–47
36.
Zurück zum Zitat Wolberg WH, Street N, Mangasarian Olvi L (1995) Breast Cancer Wisconsin (diagnostic) data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA Wolberg WH, Street N, Mangasarian Olvi L (1995) Breast Cancer Wisconsin (diagnostic) data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA
37.
Zurück zum Zitat Williams PL (2011) Information dynamics: its theory and application to embodied cognitive systems. Ph.D. thesis, Indiana University Williams PL (2011) Information dynamics: its theory and application to embodied cognitive systems. Ph.D. thesis, Indiana University
38.
Zurück zum Zitat Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224 Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Metadaten
Titel
Monotonicity of the -statistic and Feature Selection
verfasst von
Firuz Kamalov
Ho Hon Leung
Sherif Moussa
Publikationsdatum
25.03.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
Annals of Data Science / Ausgabe 6/2022
Print ISSN: 2198-5804
Elektronische ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-020-00251-7

Weitere Artikel der Ausgabe 6/2022

Annals of Data Science 6/2022 Zur Ausgabe

Premium Partner