Skip to main content
Top
Published in: Neural Computing and Applications 1/2014

01-01-2014 | Invited Review

A review of feature selection methods based on mutual information

Authors: Jorge R. Vergara, Pablo A. Estévez

Published in: Neural Computing and Applications | Issue 1/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this work, we present a review of the state of the art of information-theoretic feature selection methods. The concepts of feature relevance, redundance, and complementarity (synergy) are clearly defined, as well as Markov blanket. The problem of optimal feature selection is defined. A unifying theoretical framework is described, which can retrofit successful heuristic criteria, indicating the approximations made by each method. A number of open problems in the field are presented.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Almuallim H, Dietterich TG (1991) Learning with many irrelevant features. In: Artificial intelligence, proceedings of the ninth national conference on, AAAI Press, pp 547–552 Almuallim H, Dietterich TG (1991) Learning with many irrelevant features. In: Artificial intelligence, proceedings of the ninth national conference on, AAAI Press, pp 547–552
2.
go back to reference Almuallim H, Dietterich TG (1992) Efficient algorithms for identifying relevant features. In: Artificial intelligence, proceedings of the ninth canadian conference on, Morgan Kaufmann, pp 38–45 Almuallim H, Dietterich TG (1992) Efficient algorithms for identifying relevant features. In: Artificial intelligence, proceedings of the ninth canadian conference on, Morgan Kaufmann, pp 38–45
3.
go back to reference Balagani K, Phoha V (2010) On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Trans Pattern Anal Mach Intell 32(7):1342–1343CrossRef Balagani K, Phoha V (2010) On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Trans Pattern Anal Mach Intell 32(7):1342–1343CrossRef
4.
go back to reference Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550CrossRef Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550CrossRef
5.
go back to reference Bell AJ (2003) The co-information lattice. Analysis pp 921–926 Bell AJ (2003) The co-information lattice. Analysis pp 921–926
6.
go back to reference Bell DA, Wang H (2000) A formalism for relevance and its application in feature subset selection. Mach Learn 41(2):175–195CrossRefMATH Bell DA, Wang H (2000) A formalism for relevance and its application in feature subset selection. Mach Learn 41(2):175–195CrossRefMATH
7.
go back to reference Bellman RE (1961) Adaptive control processes: a guided tour. 1st edn. Princeton University Press, PrincetonMATH Bellman RE (1961) Adaptive control processes: a guided tour. 1st edn. Princeton University Press, PrincetonMATH
8.
go back to reference Bins J, Draper B (2001) Feature selection from huge feature sets. In: Computer Vision, 2001. Proceedings eighth IEEE international conference, vol 2, pp 159–165 Bins J, Draper B (2001) Feature selection from huge feature sets. In: Computer Vision, 2001. Proceedings eighth IEEE international conference, vol 2, pp 159–165
9.
go back to reference Bonev B, Escolano F, Cazorla M (2008) Feature selection, mutual information, and the classification of high-dimensional patterns: applications to image classification and microarray data analysis. Pattern Anal Appl 11(3-4):309–319CrossRefMathSciNet Bonev B, Escolano F, Cazorla M (2008) Feature selection, mutual information, and the classification of high-dimensional patterns: applications to image classification and microarray data analysis. Pattern Anal Appl 11(3-4):309–319CrossRefMathSciNet
10.
go back to reference Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66MathSciNet Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66MathSciNet
11.
go back to reference Brown LE, Tsamardinos I (2008) Markov blanket-based variable selection in feature space. Technical report dsl-08-01, Discovery systems laboratory, Vanderbilt University Brown LE, Tsamardinos I (2008) Markov blanket-based variable selection in feature space. Technical report dsl-08-01, Discovery systems laboratory, Vanderbilt University
12.
go back to reference Cheng H, Qin Z, Feng C, Wang Y, Li F (2011) Conditional mutual information-based feature selection analyzing for synergy and redundancy. Electron Telecommun Res Inst 33(2):210–218 Cheng H, Qin Z, Feng C, Wang Y, Li F (2011) Conditional mutual information-based feature selection analyzing for synergy and redundancy. Electron Telecommun Res Inst 33(2):210–218
13.
go back to reference Chow T, Huang D (2005) Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Trans Neural Netw 16(1):213–224CrossRefMathSciNet Chow T, Huang D (2005) Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Trans Neural Netw 16(1):213–224CrossRefMathSciNet
14.
go back to reference Cover TM, Thomas JA (2006) Elements of Information Theory. 2nd edn. Wiley-Interscience, New JerseyMATH Cover TM, Thomas JA (2006) Elements of Information Theory. 2nd edn. Wiley-Interscience, New JerseyMATH
15.
go back to reference Davies S, Russell S (1994) Np-completeness of searches for smallest possible feature sets. In: Intelligent Relevance, association for the advancement of artificial intelligence symposium on, AAAI Press, pp 37–39 Davies S, Russell S (1994) Np-completeness of searches for smallest possible feature sets. In: Intelligent Relevance, association for the advancement of artificial intelligence symposium on, AAAI Press, pp 37–39
16.
go back to reference Duch W (2006) Filter methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Heidelberg, chap 3, pp 167–185 Duch W (2006) Filter methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Heidelberg, chap 3, pp 167–185
17.
go back to reference Duch W, Winiarski T, Biesiada J, Kachel A (2003) Feature selection and ranking filter. In: International conference on artificial neural networks (ICANN) and International conference on neural information processing (ICONIP), pp 251–254 Duch W, Winiarski T, Biesiada J, Kachel A (2003) Feature selection and ranking filter. In: International conference on artificial neural networks (ICANN) and International conference on neural information processing (ICONIP), pp 251–254
18.
go back to reference Estévez PA, Caballero R (1998) A niching genetic algorithm for selecting features for neural networks classifiers. In: Perspectives in neural computation, Springer, New York, pp 311–316 Estévez PA, Caballero R (1998) A niching genetic algorithm for selecting features for neural networks classifiers. In: Perspectives in neural computation, Springer, New York, pp 311–316
19.
go back to reference Estévez PA, Tesmer M, Pérez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201CrossRef Estévez PA, Tesmer M, Pérez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201CrossRef
21.
go back to reference Fleuret F, Guyon I (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555MATH Fleuret F, Guyon I (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555MATH
22.
go back to reference Guo B, Nixon MS (2009) Gait feature subset selection by mutual information. Syst Man Cybern Part A IEEE Trans 39(1):36–46CrossRef Guo B, Nixon MS (2009) Gait feature subset selection by mutual information. Syst Man Cybern Part A IEEE Trans 39(1):36–46CrossRef
23.
go back to reference Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH
24.
go back to reference Guyon I, Elisseeff A (2006) An introduction to feature extraction. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, pp 1–25 Guyon I, Elisseeff A (2006) An introduction to feature extraction. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, pp 1–25
25.
go back to reference Guyon I, Aliferis C, Elisseeff A (2008) Causal feature selection. In: Liu H, Motoda H (eds) Computational methods of feature selection, Chapman & Hall/CRC, chap 4 Guyon I, Aliferis C, Elisseeff A (2008) Causal feature selection. In: Liu H, Motoda H (eds) Computational methods of feature selection, Chapman & Hall/CRC, chap 4
26.
27.
go back to reference Hero A, Michel O (1999) Estimation of renyi information divergence via pruned minimal spanning trees. Higher-Order statistics proceedings of the IEEE signal processing workshop, pp 264–268 Hero A, Michel O (1999) Estimation of renyi information divergence via pruned minimal spanning trees. Higher-Order statistics proceedings of the IEEE signal processing workshop, pp 264–268
28.
go back to reference Huang S (2003) Dimensionality reduction in automatic knowledge acquisition: a simple greedy search approach. IEEE Trans Knowl Data Eng 15(6):1364–1373CrossRef Huang S (2003) Dimensionality reduction in automatic knowledge acquisition: a simple greedy search approach. IEEE Trans Knowl Data Eng 15(6):1364–1373CrossRef
29.
go back to reference Jakulin A (2005) Learning based on attribute interactions. PhD thesis, University of Ljubljana, Slovenia Jakulin A (2005) Learning based on attribute interactions. PhD thesis, University of Ljubljana, Slovenia
31.
go back to reference Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324CrossRefMATH Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324CrossRefMATH
32.
go back to reference Kojadinovic I (2005) Relevance measures for subset variable selection in regression problems based on k-additive mutual information. Comput Stat Data Anal 49(4):1205–1227CrossRefMATHMathSciNet Kojadinovic I (2005) Relevance measures for subset variable selection in regression problems based on k-additive mutual information. Comput Stat Data Anal 49(4):1205–1227CrossRefMATHMathSciNet
33.
go back to reference Koller D, Sahami M (1996) Toward optimal feature selection. Technical Report 1996-77, Stanford InfoLab Koller D, Sahami M (1996) Toward optimal feature selection. Technical Report 1996-77, Stanford InfoLab
34.
go back to reference Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69:066,138CrossRef Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E 69:066,138CrossRef
35.
go back to reference Kullback S (1997) Information theory and statistics. 2nd edn. Dover, New YorkMATH Kullback S (1997) Information theory and statistics. 2nd edn. Dover, New YorkMATH
37.
go back to reference Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159CrossRef Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159CrossRef
38.
go back to reference Lal KN, Chapelle O, Weston J, Elisseeff A (2006) Embedded methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, chap 5, pp 167–185 Lal KN, Chapelle O, Weston J, Elisseeff A (2006) Embedded methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, chap 5, pp 167–185
39.
go back to reference Lewis DD (1992) Feature selection and feature extraction for text categorization. In: Proceedings of speech and natural language workshop, Morgan Kaufmann, pp 212–217 Lewis DD (1992) Feature selection and feature extraction for text categorization. In: Proceedings of speech and natural language workshop, Morgan Kaufmann, pp 212–217
40.
go back to reference Lin D, Tang X (2006) Conditional infomax learning: an integrated framework for feature extraction and fusion. In: Computer vision—ECCV 2006, Lecture Notes in Computer Science, vol 3951, Springer, Berlin, pp 68–82 Lin D, Tang X (2006) Conditional infomax learning: an integrated framework for feature extraction and fusion. In: Computer vision—ECCV 2006, Lecture Notes in Computer Science, vol 3951, Springer, Berlin, pp 68–82
41.
go back to reference Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inform Theory 9(1):11–17CrossRef Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inform Theory 9(1):11–17CrossRef
42.
43.
go back to reference Meyer P, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Sel Top Signal Process 2(3):261–274CrossRef Meyer P, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Sel Top Signal Process 2(3):261–274CrossRef
44.
go back to reference Mo D, Huang SH (2011) Feature selection based on inference correlation. Intell Data Anal 15(3):375–398 Mo D, Huang SH (2011) Feature selection based on inference correlation. Intell Data Anal 15(3):375–398
45.
go back to reference Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Trans Knowl Data Eng 24(1):59–71CrossRef Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Trans Knowl Data Eng 24(1):59–71CrossRef
46.
go back to reference Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef
47.
go back to reference Principe JC (2010) Information theoretic learning: Renyi’s entropy and kernel perspectives. 1st edn. Springer Publishing Company, BerlinCrossRef Principe JC (2010) Information theoretic learning: Renyi’s entropy and kernel perspectives. 1st edn. Springer Publishing Company, BerlinCrossRef
48.
go back to reference Raudys S, Jain A (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 13(3):252–264CrossRef Raudys S, Jain A (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 13(3):252–264CrossRef
49.
go back to reference Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517CrossRef Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517CrossRef
50.
go back to reference Sebban M, Nock R (2002) A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recogn 35(4):835–846CrossRefMATH Sebban M, Nock R (2002) A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recogn 35(4):835–846CrossRefMATH
51.
go back to reference Seth S, Príncipe J (2010) Variable selection: A statistical dependence perspective. In: Machine Learning and Applications (ICMLA), 2010 ninth international conference, pp 931–936 Seth S, Príncipe J (2010) Variable selection: A statistical dependence perspective. In: Machine Learning and Applications (ICMLA), 2010 ninth international conference, pp 931–936
52.
go back to reference Shannon CE (1948) A mathematical theory of communication. Bell System Technical J 27:379–423, 625–56 Shannon CE (1948) A mathematical theory of communication. Bell System Technical J 27:379–423, 625–56
53.
go back to reference Somol P, Pudil P, Kittler J (2004) Fast branch & bound algorithms for optimal feature selection. IEEE Trans Pattern Anal Mach Intell 26(7):900–912CrossRef Somol P, Pudil P, Kittler J (2004) Fast branch & bound algorithms for optimal feature selection. IEEE Trans Pattern Anal Mach Intell 26(7):900–912CrossRef
54.
go back to reference Stearns SD (1976) On selecting features for pattern classifiers. In: Pattern recognition, proceedings of the 3rd international conference on, Coronado, CA, pp 71–75 Stearns SD (1976) On selecting features for pattern classifiers. In: Pattern recognition, proceedings of the 3rd international conference on, Coronado, CA, pp 71–75
55.
go back to reference Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. Proceedings on 37th Annu Allerton conference communication, Control and Computing Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. Proceedings on 37th Annu Allerton conference communication, Control and Computing
56.
go back to reference Torkkola K (2006) Information-theoretic methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, chap 6, pp 167–185 Torkkola K (2006) Information-theoretic methods. In: Feature extraction, foundations and applications, studies in fuzziness and soft computing, vol 207, Springer, Berlin, chap 6, pp 167–185
57.
go back to reference Trunk GV (1979) A problem of dimensionality: A simple example. IEEE Trans Pattern Anal Mach Intell-1(3):306–307 Trunk GV (1979) A problem of dimensionality: A simple example. IEEE Trans Pattern Anal Mach Intell-1(3):306–307
58.
go back to reference Tsamardinos I, Aliferis CF (2003) Towards principled feature selection: relevancy, filters and wrappers. In: Artificial intelligence and statistics, proceedings of the ninth international workshop, Morgan Kaufmann Publishers Tsamardinos I, Aliferis CF (2003) Towards principled feature selection: relevancy, filters and wrappers. In: Artificial intelligence and statistics, proceedings of the ninth international workshop, Morgan Kaufmann Publishers
59.
go back to reference Tsamardinos I, Aliferis CF, Statnikov E (2003) Algorithms for large scale markov blanket discovery. In: The 16th international FLAIRS conference, St, AAAI Press, pp 376–380 Tsamardinos I, Aliferis CF, Statnikov E (2003) Algorithms for large scale markov blanket discovery. In: The 16th international FLAIRS conference, St, AAAI Press, pp 376–380
60.
go back to reference Vergara JR, Estévez PA (2010) Cmim-2: an enhanced conditional mutual information maximization criterion for feature selection. J Appl Comput Sci Methods 2(1):5–20 Vergara JR, Estévez PA (2010) Cmim-2: an enhanced conditional mutual information maximization criterion for feature selection. J Appl Comput Sci Methods 2(1):5–20
61.
go back to reference Vidal-Naquet M, Ullman S (2003) Object recognition with informative features and linear classification. In: Computer Vision, 2003. Proceedings ninth IEEE international conference, vol 1, pp 281–288 Vidal-Naquet M, Ullman S (2003) Object recognition with informative features and linear classification. In: Computer Vision, 2003. Proceedings ninth IEEE international conference, vol 1, pp 281–288
62.
go back to reference Watanabe S (1960) Information theoretical analysis of multivariate correlation. IBM J Res Dev 4(1):66–82CrossRefMATH Watanabe S (1960) Information theoretical analysis of multivariate correlation. IBM J Res Dev 4(1):66–82CrossRefMATH
64.
go back to reference Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature selection for svms. In: Advances in neural information processing systems 13, MIT Press, pp 668–674 Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature selection for svms. In: Advances in neural information processing systems 13, MIT Press, pp 668–674
65.
66.
go back to reference Yang HH, Moody J (1999) Feature selection based on joint mutual information. In: Advances in intelligent data analysis, proceedings of international ICSC symposium, pp 22–25 Yang HH, Moody J (1999) Feature selection based on joint mutual information. In: Advances in intelligent data analysis, proceedings of international ICSC symposium, pp 22–25
67.
go back to reference Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MATH Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MATH
68.
go back to reference Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228 Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228
Metadata
Title
A review of feature selection methods based on mutual information
Authors
Jorge R. Vergara
Pablo A. Estévez
Publication date
01-01-2014
Publisher
Springer London
Published in
Neural Computing and Applications / Issue 1/2014
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-013-1368-0

Other articles of this Issue 1/2014

Neural Computing and Applications 1/2014 Go to the issue

Premium Partner