Top

Neural Computing and Applications

Published in:

01-01-2014 | Invited Review

A review of feature selection methods based on mutual information

Authors: Jorge R. Vergara, Pablo A. Estévez

Published in: Neural Computing and Applications | Issue 1/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this work, we present a review of the state of the art of information-theoretic feature selection methods. The concepts of feature relevance, redundance, and complementarity (synergy) are clearly defined, as well as Markov blanket. The problem of optimal feature selection is defined. A unifying theoretical framework is described, which can retrofit successful heuristic criteria, indicating the approximations made by each method. A number of open problems in the field are presented.

previous article Cuckoo search: recent advances and applications

next article Solution of nonlinear fractional differential equations using an efficient approach

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Almuallim H, Dietterich TG (1991) Learning with many irrelevant features. In: Artificial intelligence, proceedings of the ninth national conference on, AAAI Press, pp 547–552

Almuallim H, Dietterich TG (1992) Efficient algorithms for identifying relevant features. In: Artificial intelligence, proceedings of the ninth canadian conference on, Morgan Kaufmann, pp 38–45

Balagani K, Phoha V (2010) On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Trans Pattern Anal Mach Intell 32(7):1342–1343CrossRef

Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550CrossRef

Bell AJ (2003) The co-information lattice. Analysis pp 921–926

Bell DA, Wang H (2000) A formalism for relevance and its application in feature subset selection. Mach Learn 41(2):175–195CrossRefMATH

Bellman RE (1961) Adaptive control processes: a guided tour. 1st edn. Princeton University Press, PrincetonMATH

Bins J, Draper B (2001) Feature selection from huge feature sets. In: Computer Vision, 2001. Proceedings eighth IEEE international conference, vol 2, pp 159–165

Bonev B, Escolano F, Cazorla M (2008) Feature selection, mutual information, and the classification of high-dimensional patterns: applications to image classification and microarray data analysis. Pattern Anal Appl 11(3-4):309–319CrossRefMathSciNet

10.

Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66MathSciNet

11.

Brown LE, Tsamardinos I (2008) Markov blanket-based variable selection in feature space. Technical report dsl-08-01, Discovery systems laboratory, Vanderbilt University

12.

Cheng H, Qin Z, Feng C, Wang Y, Li F (2011) Conditional mutual information-based feature selection analyzing for synergy and redundancy. Electron Telecommun Res Inst 33(2):210–218

13.

Chow T, Huang D (2005) Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Trans Neural Netw 16(1):213–224CrossRefMathSciNet

14.

Cover TM, Thomas JA (2006) Elements of Information Theory. 2nd edn. Wiley-Interscience, New JerseyMATH

15.

Davies S, Russell S (1994) Np-completeness of searches for smallest possible feature sets. In: Intelligent Relevance, association for the advancement of artificial intelligence symposium on, AAAI Press, pp 37–39

16.

Duch W (2006) Filter methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Heidelberg, chap 3, pp 167–185

17.

Duch W, Winiarski T, Biesiada J, Kachel A (2003) Feature selection and ranking filter. In: International conference on artificial neural networks (ICANN) and International conference on neural information processing (ICONIP), pp 251–254

18.

Estévez PA, Caballero R (1998) A niching genetic algorithm for selecting features for neural networks classifiers. In: Perspectives in neural computation, Springer, New York, pp 311–316

19.

Estévez PA, Tesmer M, Pérez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201CrossRef

20.

Feder M, Merhav N (1994) Relations between entropy and error probability. IEEE Trans Inform Theory 40(1):259–266CrossRefMATHMathSciNet

21.

Fleuret F, Guyon I (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555MATH

22.

Guo B, Nixon MS (2009) Gait feature subset selection by mutual information. Syst Man Cybern Part A IEEE Trans 39(1):36–46CrossRef

23.

Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH

24.

Guyon I, Elisseeff A (2006) An introduction to feature extraction. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, pp 1–25

25.

Guyon I, Aliferis C, Elisseeff A (2008) Causal feature selection. In: Liu H, Motoda H (eds) Computational methods of feature selection, Chapman & Hall/CRC, chap 4

26.

Hellman M, Raviv J (1970) Probability of error, equivocation, and the chernoff bound. IEEE Trans Inform Theory 16(4):368–372CrossRefMATHMathSciNet

27.

Hero A, Michel O (1999) Estimation of renyi information divergence via pruned minimal spanning trees. Higher-Order statistics proceedings of the IEEE signal processing workshop, pp 264–268

28.

Huang S (2003) Dimensionality reduction in automatic knowledge acquisition: a simple greedy search approach. IEEE Trans Knowl Data Eng 15(6):1364–1373CrossRef

29.

Jakulin A (2005) Learning based on attribute interactions. PhD thesis, University of Ljubljana, Slovenia

30.

Jakulin A, Bratko I (2003) Quantifying and visualizing attribute interactions. CoRR cs.AI/0308002, http://arxiv.org/abs/cs.AI/0308002

31.

Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324CrossRefMATH

32.

Kojadinovic I (2005) Relevance measures for subset variable selection in regression problems based on k-additive mutual information. Comput Stat Data Anal 49(4):1205–1227CrossRefMATHMathSciNet

33.

Koller D, Sahami M (1996) Toward optimal feature selection. Technical Report 1996-77, Stanford InfoLab

34.

Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69:066,138CrossRef

35.

Kullback S (1997) Information theory and statistics. 2nd edn. Dover, New YorkMATH

36.

Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:49–86CrossRefMathSciNet

37.

Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159CrossRef

38.

Lal KN, Chapelle O, Weston J, Elisseeff A (2006) Embedded methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, chap 5, pp 167–185

39.

Lewis DD (1992) Feature selection and feature extraction for text categorization. In: Proceedings of speech and natural language workshop, Morgan Kaufmann, pp 212–217

40.

Lin D, Tang X (2006) Conditional infomax learning: an integrated framework for feature extraction and fusion. In: Computer vision—ECCV 2006, Lecture Notes in Computer Science, vol 3951, Springer, Berlin, pp 68–82

41.

Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inform Theory 9(1):11–17CrossRef

42.

McGill W (1954) Multivariate information transmission. Psychometrika 19(2):97–116CrossRefMATH

43.

Meyer P, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Sel Top Signal Process 2(3):261–274CrossRef

44.

Mo D, Huang SH (2011) Feature selection based on inference correlation. Intell Data Anal 15(3):375–398

45.

Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Trans Knowl Data Eng 24(1):59–71CrossRef

46.

Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef

47.

Principe JC (2010) Information theoretic learning: Renyi’s entropy and kernel perspectives. 1st edn. Springer Publishing Company, BerlinCrossRef

48.

Raudys S, Jain A (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 13(3):252–264CrossRef

49.

Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517CrossRef

50.

Sebban M, Nock R (2002) A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recogn 35(4):835–846CrossRefMATH

51.

Seth S, Príncipe J (2010) Variable selection: A statistical dependence perspective. In: Machine Learning and Applications (ICMLA), 2010 ninth international conference, pp 931–936

52.

Shannon CE (1948) A mathematical theory of communication. Bell System Technical J 27:379–423, 625–56

53.

Somol P, Pudil P, Kittler J (2004) Fast branch & bound algorithms for optimal feature selection. IEEE Trans Pattern Anal Mach Intell 26(7):900–912CrossRef

54.

Stearns SD (1976) On selecting features for pattern classifiers. In: Pattern recognition, proceedings of the 3rd international conference on, Coronado, CA, pp 71–75

55.

Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. Proceedings on 37th Annu Allerton conference communication, Control and Computing

56.

Torkkola K (2006) Information-theoretic methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, chap 6, pp 167–185

57.

Trunk GV (1979) A problem of dimensionality: A simple example. IEEE Trans Pattern Anal Mach Intell-1(3):306–307

58.

Tsamardinos I, Aliferis CF (2003) Towards principled feature selection: relevancy, filters and wrappers. In: Artificial intelligence and statistics, proceedings of the ninth international workshop, Morgan Kaufmann Publishers

59.

Tsamardinos I, Aliferis CF, Statnikov E (2003) Algorithms for large scale markov blanket discovery. In: The 16th international FLAIRS conference, St, AAAI Press, pp 376–380

60.

Vergara JR, Estévez PA (2010) Cmim-2: an enhanced conditional mutual information maximization criterion for feature selection. J Appl Comput Sci Methods 2(1):5–20

61.

Vidal-Naquet M, Ullman S (2003) Object recognition with informative features and linear classification. In: Computer Vision, 2003. Proceedings ninth IEEE international conference, vol 1, pp 281–288

62.

Watanabe S (1960) Information theoretical analysis of multivariate correlation. IBM J Res Dev 4(1):66–82CrossRefMATH

63.

Webb AR (2002) Statistical Pattern Recognition. 2nd edn. Wiley, New JerseyCrossRefMATH

64.

Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature selection for svms. In: Advances in neural information processing systems 13, MIT Press, pp 668–674

65.

Whitney A (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput C-20(9):1100–1103CrossRefMathSciNet

66.

Yang HH, Moody J (1999) Feature selection based on joint mutual information. In: Advances in intelligent data analysis, proceedings of international ICSC symposium, pp 22–25

67.

Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MATH

68.

Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228

Title: A review of feature selection methods based on mutual information
Authors: Jorge R. Vergara
Pablo A. Estévez
Publication date: 01-01-2014
Publisher: Springer London
Published in: Neural Computing and Applications / Issue 1/2014
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-013-1368-0

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 1/2014

Evolutionary dynamics of sales agents’ promotional effort on small-world networks

Does social network always promote entrepreneurial intentions? An empirical study in China

Solution of nonlinear fractional differential equations using an efficient approach

Robots learn to dance through interaction with humans

Cuckoo search: recent advances and applications

A palm vein identification system based on Gabor wavelet features

Premium Partner