Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 5/2016

01-10-2016 | Original Article

A supervised term selection technique for effective text categorization

Authors: Tanmay Basu, C. A. Murthy

Published in: International Journal of Machine Learning and Cybernetics | Issue 5/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Term selection methods in text categorization effectively reduce the size of the vocabulary to improve the quality of classifier. Each corpus generally contains many irrelevant and noisy terms, which eventually reduces the effectiveness of text categorization. Term selection, thus, focuses on identifying the relevant terms for each category without affecting the quality of text categorization. A new supervised term selection technique have been proposed for dimensionality reduction. The method assigns a score to each term of a corpus based on its similarity with all the categories, and then all the terms of the corpus are ranked accordingly. Subsequently the significant terms of each category are selected to create the final subset of terms irrespective of the size of the category. The performance of the proposed term selection technique is compared with the performance of nine other term selection methods for categorization of several well known text corpora using kNN and SVM classifiers. The empirical results show that the proposed method performs significantly better than the other methods in most of the cases of all the corpora.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Appendix
Available only for authorised users
Footnotes
4
The test statistic is of the form \(t=\frac{\overline{x}_1-\overline{x}_2}{\sqrt{s^2_1/n_1+s^2_2/n_2}}\), where \(\overline{x}_1, \overline{x}_2\) are the means, \(s_1, s_2\) are the standard deviations and \(n_1, n_2\) are the number of observations.
 
Literature
1.
go back to reference Gliozzo A, Strapparava C (2005) Domain kernels for text categorization. In: Proceedings of the ninth international conference on computational natural language learning Gliozzo A, Strapparava C (2005) Domain kernels for text categorization. In: Proceedings of the ninth international conference on computational natural language learning
2.
go back to reference Simeon M, Hilderman R (2008) Categorical proportional difference: a feature selection method for text categorization. In: Proceedings of the Australian data mining conference, pp 201–208 Simeon M, Hilderman R (2008) Categorical proportional difference: a feature selection method for text categorization. In: Proceedings of the Australian data mining conference, pp 201–208
3.
go back to reference Aphinyanaphongs Y, Fu LD, Li Z, Peskin ER, Efstathiadis E, Aliferis CF, Statnikov A (2014) A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J Assoc Inf Sci Technol 65(10) Aphinyanaphongs Y, Fu LD, Li Z, Peskin ER, Efstathiadis E, Aliferis CF, Statnikov A (2014) A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J Assoc Inf Sci Technol 65(10)
4.
go back to reference Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Dec Supp Syst Web Retriev Min 35(1):45–87CrossRef Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Dec Supp Syst Web Retriev Min 35(1):45–87CrossRef
5.
go back to reference Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on applied computing, Melbourne, Australia, pp 784–788 Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on applied computing, Melbourne, Australia, pp 784–788
6.
go back to reference Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of international conference on machine learning, pp 412–420 Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of international conference on machine learning, pp 412–420
7.
go back to reference Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newslett Spec Issue Learn Imbalanced Datasets 6(1):80–89CrossRef Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newslett Spec Issue Learn Imbalanced Datasets 6(1):80–89CrossRef
8.
go back to reference Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162CrossRef Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162CrossRef
9.
go back to reference Koller D, Sahami M (1996) Toward optimal feature selection. In: Proceedings of international conference on machine learning, pp 284–292 Koller D, Sahami M (1996) Toward optimal feature selection. In: Proceedings of international conference on machine learning, pp 284–292
10.
go back to reference Galavotti L, Sebastiani F, Simi M (2000) Feature selection and negative evidence in automated text categorization. In: Proceedings of the knowledge discovery and data mining workshop on text mining Galavotti L, Sebastiani F, Simi M (2000) Feature selection and negative evidence in automated text categorization. In: Proceedings of the knowledge discovery and data mining workshop on text mining
11.
go back to reference Forman G (2003) An extensive empirical study of feature selection metrics for text categorization. J Mach Learn Res 3(1):1289–1305MATH Forman G (2003) An extensive empirical study of feature selection metrics for text categorization. J Mach Learn Res 3(1):1289–1305MATH
12.
go back to reference Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of international conference on machine learning, pp 258–267 Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of international conference on machine learning, pp 258–267
13.
go back to reference Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Exp Syst Appl 33(1):1–5 Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Exp Syst Appl 33(1):1–5
14.
go back to reference Karypis G, Han EH (2000) Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of the international conference on information and knowledge management, pp 12–19 Karypis G, Han EH (2000) Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of the international conference on information and knowledge management, pp 12–19
15.
go back to reference Li S, Xia R, Zong C, Huang C (2009) A framework of feature selection methods for text categorization. In: Proceedings of the 47th annual meeting of ACL and the 4th international joint conference on natural language processing, pp 692–700 Li S, Xia R, Zong C, Huang C (2009) A framework of feature selection methods for text categorization. In: Proceedings of the 47th annual meeting of ACL and the 4th international joint conference on natural language processing, pp 692–700
16.
go back to reference Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of the IEEE international conference on data mining workshops, ICDMW’12, Brussels, Belgium, pp 918–925 Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of the IEEE international conference on data mining workshops, ICDMW’12, Brussels, Belgium, pp 918–925
17.
go back to reference Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754CrossRef Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754CrossRef
18.
go back to reference Feng G, Guo J, Jing BY, Hao L (2012) A bayesian feature selection paradigm for text classification. Inf Process Manag 48(2):283–302CrossRef Feng G, Guo J, Jing BY, Hao L (2012) A bayesian feature selection paradigm for text classification. Inf Process Manag 48(2):283–302CrossRef
19.
go back to reference Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2014) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst. doi:10.1109/TFUZZ.2014.2371479 Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2014) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst. doi:10.​1109/​TFUZZ.​2014.​2371479
20.
go back to reference Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retriev Kluwer Academic Publishers 1(1–2):69–90 Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retriev Kluwer Academic Publishers 1(1–2):69–90
21.
go back to reference Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the european conference on machine learning, Berlin, Germany, pp 137–142 Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the european conference on machine learning, Berlin, Germany, pp 137–142
22.
go back to reference Pilaszy I (2005) Text categorization and support vector machines. In: Proceedings of the sixth international symposium of Hungarian researchers on computational intelligence Pilaszy I (2005) Text categorization and support vector machines. In: Proceedings of the sixth international symposium of Hungarian researchers on computational intelligence
23.
go back to reference Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27CrossRef Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27CrossRef
24.
go back to reference Zhang W, Yoshida T, Tang X (2008) Text classification based on multi-word with support vector machine. Knowled Based Syst 21(8):879–886CrossRef Zhang W, Yoshida T, Tang X (2008) Text classification based on multi-word with support vector machine. Knowled Based Syst 21(8):879–886CrossRef
25.
go back to reference Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84CrossRef Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84CrossRef
27.
go back to reference Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef
28.
go back to reference Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York
29.
go back to reference Lehmann EL (1976) Testing of statistical hypotheses. Wiley, New York Lehmann EL (1976) Testing of statistical hypotheses. Wiley, New York
30.
go back to reference Rao CR, Mitra SK, Matthai A, Ramamurthy KG (eds) (1966) Formulae and tables for statistical Rao CR, Mitra SK, Matthai A, Ramamurthy KG (eds) (1966) Formulae and tables for statistical
Metadata
Title
A supervised term selection technique for effective text categorization
Authors
Tanmay Basu
C. A. Murthy
Publication date
01-10-2016
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 5/2016
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-015-0421-y

Other articles of this Issue 5/2016

International Journal of Machine Learning and Cybernetics 5/2016 Go to the issue