nach oben

Annals of Data Science

Erschienen in:

25.03.2020

Monotonicity of the \(\chi ^2\)-statistic and Feature Selection

verfasst von: Firuz Kamalov, Ho Hon Leung, Sherif Moussa

Erschienen in: Annals of Data Science | Ausgabe 6/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Feature selection is an important preprocessing step in analyzing large scale data. In this paper, we prove the monotonicity property of the \(\chi ^2\)-statistic and use it to construct a more robust feature selection method. In particular, we show that \(\chi ^2_{Y, X_1} \le \chi ^2_{Y, (X_1, X_2)}\). This result indicates that a new feature should be added to an existing feature set only if it increases the \(\chi ^2\)-statistic beyond a certain threshold. Our stepwise feature selection algorithm significantly reduces the number of features considered at each stage making it more efficient than other similar methods. In addition, the selection process has a natural stopping point thus eliminating the need for user input. Numerical experiments confirm that the proposed algorithm can significantly reduce the number of features required for classification and improve classifier accuracy.

Vorheriger Artikel Efficient Equalization and Carrier Frequency Offset Compensation for Underwater Wireless Communication Systems

Nächster Artikel An Application of Time Truncated Single Acceptance Sampling Inspection Plan Based on Generalized Half-Normal Distribution

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

All the experiments are performed using the scikit-learn library in Python [22].

Bryant F, Satorra A (2012) Principles and practice of scaled difference chi-square testing. Struct Equ Model Multidiscip J 19(3):372–398CrossRef

Brown G et al (2012) Conditional likelihood maximization: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66

Buza K (2014) Feedback prediction for blogs. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Machine learning and knowledge discovery, data analysis. Springer, Cham, pp 145–152CrossRef

Cios KJ, Kurgan LA (2001) SPECT heart data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA

Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1–2):155–176CrossRef

Franke TM, Ho T, Christie CA (2011) The chi-square test. Am J Eval 33(3):448–458CrossRef

Guyon I (2003) Design of experiments for the NIPS 2003 variable selection benchmark

Haddi E, Liu X, Shi Y (2013) The role of text pre-processing in sentiment analysis. Procedia Comput Sci 17:26–32CrossRef

Harder M, Salge C, Polani D (2013) Bivariate measure of redundant information. Phys Rev 87(1):012130

10.

Hancer E, Xue B, Zhang M (2018) Differential evolution for filter feature selection based on information theory and feature ranking. Knowl Based Syst 140:103–119CrossRef

11.

Jin X et al (2006) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In: International workshop on data mining for biomedical applications. Springer, Berlin

12.

Kamalov F, Thabtah F (2017) A feature selection method based on ranked vector scores of features for classification. Ann Data Sci 4:1–20CrossRef

13.

Kamalov F (2018) Sensitivity analysis for feature selection. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 1466–1470

14.

Kamalov F, Thabtah F (2020) Outlier detection in high dimensional data. J Inf Knowl Manag 19(1):2040013 (15 pages)CrossRef

15.

Khoshgoftaar T, Gao K, Napolitano A, Wald R (2014) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf Syst Front 16(5):801–822CrossRef

16.

Kononenko I, Cestnik B (1988) UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA

17.

Li C, Xu J (2019) Feature selection with the Fisher score followed by the Maximal Clique Centrality algorithm can accurately identify the hub genes of hepatocellular carcinoma. Sci Rep 9(2019):17283CrossRef

18.

Li Y, Luo C, Chung SM (2008) Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng 20(5):641–652CrossRef

19.

Liu H, Motoda H (eds) (2007) Computational methods of feature selection. CRC Press, Boca Raton

20.

Moh’d A, Mesleh A (2007) Chi square feature extraction based SVMs Arabic language text categorization system. J Comput Sci 3(6):430–435CrossRef

21.

Nene SA, Nayar SK, Murase H (1996) Columbia object image library (coil-20)

22.

Pedregosa F et al (2011) Scikit-learn: machine learning in Python. JMLR 12:2825–2830

23.

Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef

24.

Satorra A, Bentler PM (2010) Ensuring positiveness of the scaled difference chi-square test statistic. Psychometrika 75(2):243–248CrossRef

25.

Schlimmer J (1987) Mushroom data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA

26.

Schlimmer J (1987) Congressional voting records data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA

27.

Sun L, Zhang XY, Qian YH, Xu JC, Zhang SG, Tian Y (2019) Joint neighborhood entropy-based gene selection method with fisher score for tumor classification. Appl Intell 49(4):1245–1259CrossRef

28.

Tang B, Kay S, He H (2016) Toward optimal feature selection in naive Bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521CrossRef

29.

Thabtah F, Kamalov F (2017) Phishing detection: a case analysis on classifiers with rules using machine learning. J Inf Knowl Manag 16(4):1–16

30.

Thabtah F, Kamalov F, Rajab K (2018) A new computational intelligence approach to detect autistic features for autism screening. Int J Med Inform 117:112–124CrossRef

31.

Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag 50(1):104–112CrossRef

32.

Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH (2018) Relief-based feature selection: introduction and review. J Biomed Inform 85:189–203CrossRef

33.

Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH (2018) Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform 85:168–188CrossRef

34.

Voiculescu D (1993) The analogues of entropy and of Fisher’s information measure in free probability theory, I. Commun Math Phys 155(1):71–92CrossRef

35.

Wang Y, Ni S, Priestley J (2019) Improving risk modeling via feature selection, hyper-parameter adjusting, and model ensembling. Glob J Econ Finance 3(1):30–47

36.

Wolberg WH, Street N, Mangasarian Olvi L (1995) Breast Cancer Wisconsin (diagnostic) data set. UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA

37.

Williams PL (2011) Information dynamics: its theory and application to embodied cognitive systems. Ph.D. thesis, Indiana University

38.

Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224

Titel: Monotonicity of the -statistic and Feature Selection
verfasst von: Firuz Kamalov
Ho Hon Leung
Sherif Moussa
Publikationsdatum: 25.03.2020
Verlag: Springer Berlin Heidelberg
Erschienen in: Annals of Data Science / Ausgabe 6/2022
Print ISSN: 2198-5804
Elektronische ISSN: 2198-5812
DOI: https://doi.org/10.1007/s40745-020-00251-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 6/2022

Moments of Dual Generalized Order Statistics from Topp Leone Weighted Weibull Distribution and Characterization

Human Capital in Public Research Laboratory: Towards an Alternative Evaluation and Prediction Method Based on Hybridization

Offline Signature Verification: An Application of GLCM Features in Machine Learning

Correction to: Guest Editor’s Introduction: COVID-19 and Data Science

An Application of Time Truncated Single Acceptance Sampling Inspection Plan Based on Generalized Half-Normal Distribution

Application of New Companding Techniques on the DWT-Based SC-FDMA System

Premium Partner