Skip to main content

2014 | OriginalPaper | Buchkapitel

A Survey on Filter Techniques for Feature Selection in Text Mining

verfasst von : Kusum Kumari Bharti, Pramod kumar Singh

Erschienen in: Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), December 28-30, 2012

Verlag: Springer India

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A large portion of a document is usually covered by irrelevant features. Instead of identifying actual context of the document, such features increase dimensions in the representation model and computational complexity of underlying algorithm, and hence adversely affect the performance. It necessitates a requirement of relevant feature selection in the given feature space. In this context, feature selection plays a key role in removing irrelevant features from the original feature space. Feature selection methods are broadly categorized into three groups: filter, wrapper, and embedded. Filter methods are widely used in text mining because of their simplicity, computational complexity, and efficiency. In this article, we provide a brief survey of filter feature selection methods along with some of the recent developments in this area.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with Naïve Bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009) Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with Naïve Bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009)
2.
Zurück zum Zitat Chen, X.: An improved branch and bound algorithm for feature selection. Pattern Recogn. Lett. 24(12), 1925–1933 (2003) Chen, X.: An improved branch and bound algorithm for feature selection. Pattern Recogn. Lett. 24(12), 1925–1933 (2003)
3.
Zurück zum Zitat Chuang, L.Y., Tsai, S.W., Yang, C.H.: Improved binary particle swarm optimization using catfish effect for feature selection. Expert Syst. Appl. 38(10), 12699–12707 (2011) Chuang, L.Y., Tsai, S.W., Yang, C.H.: Improved binary particle swarm optimization using catfish effect for feature selection. Expert Syst. Appl. 38(10), 12699–12707 (2011)
4.
Zurück zum Zitat Chuang, L.Y., Yang, C.H., Wu, K.C., Yang, C.H.: A hybrid feature selection method for DNA microarray data. Comput. Biol. Med. 41(4), 228–237 (2011) Chuang, L.Y., Yang, C.H., Wu, K.C., Yang, C.H.: A hybrid feature selection method for DNA microarray data. Comput. Biol. Med. 41(4), 228–237 (2011)
5.
Zurück zum Zitat Church, K.W., Hanks, P.: Word association norm, mutual information and lexicography. J. Comput. Linguist. 27(1), 22–29 (1990) Church, K.W., Hanks, P.: Word association norm, mutual information and lexicography. J. Comput. Linguist. 27(1), 22–29 (1990)
6.
Zurück zum Zitat Deerwester, S.: Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American Society for Information Science, Vol. 25, pp. 36–40 (1988) Deerwester, S.: Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American Society for Information Science, Vol. 25, pp. 36–40 (1988)
7.
Zurück zum Zitat Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinf. Comput. Biol. 185–205 (2005) Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinf. Comput. Biol. 185–205 (2005)
8.
Zurück zum Zitat Ferreira, A.J., Figueired, M.A.T.: Efficient feature selection filters for high-dimensional data. Pattern Recogn. Lett. 33(13), 1794–1804 (2012) Ferreira, A.J., Figueired, M.A.T.: Efficient feature selection filters for high-dimensional data. Pattern Recogn. Lett. 33(13), 1794–1804 (2012)
9.
Zurück zum Zitat Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. Thesis. Department of Computer Science, University of Waikato (1999) Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. Thesis. Department of Computer Science, University of Waikato (1999)
10.
Zurück zum Zitat Hsu, H.H., Hsieh, C. W., Lu, M.D.: Hybrid feature selection by combining filters and wrappers. Expert Syst. Appl. 38(7), 8144–8150 (2011) Hsu, H.H., Hsieh, C. W., Lu, M.D.: Hybrid feature selection by combining filters and wrappers. Expert Syst. Appl. 38(7), 8144–8150 (2011)
11.
Zurück zum Zitat Li, B., Zhang, P., Ren, G., Xing, Z.: A two stage feature selection method for gear fault diagnosis using reliefF and GA-wrapper. In: Proceedings International Conference on Measuring Technology and Mechatronics Automation, pp. 578–581 (2009) Li, B., Zhang, P., Ren, G., Xing, Z.: A two stage feature selection method for gear fault diagnosis using reliefF and GA-wrapper. In: Proceedings International Conference on Measuring Technology and Mechatronics Automation, pp. 578–581 (2009)
12.
Zurück zum Zitat Liu, L., Kang, J., Yu, J., Wang, Z.: A comparative study on unsupervised feature selection methods for text clustering. In: Proceedings of Natural Language Processing and Knowledge, Engineering, pp. 59–601 (2005) Liu, L., Kang, J., Yu, J., Wang, Z.: A comparative study on unsupervised feature selection methods for text clustering. In: Proceedings of Natural Language Processing and Knowledge, Engineering, pp. 59–601 (2005)
13.
Zurück zum Zitat Liu, Y., Qin, Z., Xu, Z., He, X.: Feature selection with particle swarms. In: Computational and Information Science, pp. 425–430. Springer, Heidelberg (2004) Liu, Y., Qin, Z., Xu, Z., He, X.: Feature selection with particle swarms. In: Computational and Information Science, pp. 425–430. Springer, Heidelberg (2004)
14.
Zurück zum Zitat Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., Wang, S.: An improved particle swarm optimization for feature selection. J. Bionic Eng. 8(2), 191–200 (2011) Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., Wang, S.: An improved particle swarm optimization for feature selection. J. Bionic Eng. 8(2), 191–200 (2011)
15.
Zurück zum Zitat Meng, J., Lin, H., Yu, Y.: A two-stage feature selection method for text categorization. Knowl.-Based Syst. 62(7), 2793–2800 (2011) Meng, J., Lin, H., Yu, Y.: A two-stage feature selection method for text categorization. Knowl.-Based Syst. 62(7), 2793–2800 (2011)
16.
Zurück zum Zitat Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Machine Intell. 24(3), 301–312 (2002) Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Machine Intell. 24(3), 301–312 (2002)
17.
Zurück zum Zitat Ng, H. T., Goh, W. B., Low, K. L.: Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th ACM International Conference on Research and Development in, Information Retrieval, pp. 67–73 (1997) Ng, H. T., Goh, W. B., Low, K. L.: Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th ACM International Conference on Research and Development in, Information Retrieval, pp. 67–73 (1997)
18.
Zurück zum Zitat Pearson, K.: On lines and planes of closest filt to systems of points in space. Phil. Mag. 1(6), 559–572 (1901) Pearson, K.: On lines and planes of closest filt to systems of points in space. Phil. Mag. 1(6), 559–572 (1901)
19.
Zurück zum Zitat Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005) Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
20.
Zurück zum Zitat Pudil, P., Novoviciva, J., Kittler, J.: Floating search methods in feature selection. Pattern Recogn. Lett. 15(11), 1119–1125 (1994) Pudil, P., Novoviciva, J., Kittler, J.: Floating search methods in feature selection. Pattern Recogn. Lett. 15(11), 1119–1125 (1994)
21.
Zurück zum Zitat Quinlan, J.R.: Induction of decision tree. Mach. learn. 1(1), 81–106 (1986) Quinlan, J.R.: Induction of decision tree. Mach. learn. 1(1), 81–106 (1986)
22.
Zurück zum Zitat Salton, G., Wong, A., Yang, C. S.: A vector space model for automatic indexing. Commun. ACM18(11), 613–620 (1975) Salton, G., Wong, A., Yang, C. S.: A vector space model for automatic indexing. Commun. ACM18(11), 613–620 (1975)
23.
Zurück zum Zitat Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text clustering. Expert Syst. Appl. 33(1), 1–5 (2007) Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text clustering. Expert Syst. Appl. 33(1), 1–5 (2007)
24.
Zurück zum Zitat Shevade, S., Keerthi, S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003) Shevade, S., Keerthi, S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)
25.
Zurück zum Zitat Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11–12), 1901–1907 (2009) Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11–12), 1901–1907 (2009)
26.
Zurück zum Zitat Tu, C.J., Chuang, L.Y., Chang, J.Y., Yang, C.H.: Feature selection using PSO-SVM. In: Proceedings of Multiconferenc of Engineers, pp. 138–143 (2006) Tu, C.J., Chuang, L.Y., Chang, J.Y., Yang, C.H.: Feature selection using PSO-SVM. In: Proceedings of Multiconferenc of Engineers, pp. 138–143 (2006)
27.
Zurück zum Zitat Uguz, H.: A hybrid system based on information gain and principal component analysis for the classification of transcranial Doppler signals. Comput. Methods Programs Biomed. 107(3), 598–609 (2012) Uguz, H.: A hybrid system based on information gain and principal component analysis for the classification of transcranial Doppler signals. Comput. Methods Programs Biomed. 107(3), 598–609 (2012)
28.
Zurück zum Zitat Uguz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl. Based. Syst. 24(7), 1024–1032 (2011) Uguz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl. Based. Syst. 24(7), 1024–1032 (2011)
29.
Zurück zum Zitat Unler, A., Murat, A., Chinnam, R.B.: \(\text{ mr }^{2}\text{ PSO }\): A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Inf. Sci. 181(20), 4625–4641 (2011) Unler, A., Murat, A., Chinnam, R.B.: \(\text{ mr }^{2}\text{ PSO }\): A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Inf. Sci. 181(20), 4625–4641 (2011)
30.
Zurück zum Zitat Yang, C.H., Chuang, L.Y., Yang, C.H.: IG-GA: a hybrid filter/wrapper method for feature selection of microarray data. J. Med. Biol. Eng. 30(1), 23–28 (2009) Yang, C.H., Chuang, L.Y., Yang, C.H.: IG-GA: a hybrid filter/wrapper method for feature selection of microarray data. J. Med. Biol. Eng. 30(1), 23–28 (2009)
Metadaten
Titel
A Survey on Filter Techniques for Feature Selection in Text Mining
verfasst von
Kusum Kumari Bharti
Pramod kumar Singh
Copyright-Jahr
2014
Verlag
Springer India
DOI
https://doi.org/10.1007/978-81-322-1602-5_154

Premium Partner