Skip to main content
Top
Published in: Arabian Journal for Science and Engineering 11/2019

21-05-2019 | Research Article -Computer Engineering and Computer Science

On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

Authors: Turgut Dogan, Alper Kursat Uysal

Published in: Arabian Journal for Science and Engineering | Issue 11/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The performance of text classification can be affected by the choice of appropriate term weighting scheme as well as other parameters. The terminology supervised term weighting scheme has become popular in recent years, as it may provide discriminative representation in vector space for text documents belonging to different classes. A term weighting scheme generally consists of three factors, namely term frequency factor, collection frequency factor, and length normalization factor. The researchers mostly have been focused on developing new collection frequency factors in term weighting studies. However, the term frequency factor has an important role, especially in supervised term weighting. In this study, we extensively analyzed the effects of using different term frequency factors on seven supervised term weighting schemes. While six of these supervised term weighting schemes were applied in the previous studies in the literature, we derived one of them from an existing feature selection method and it was not used as a weighting method before. This analysis is performed using SVM and Roccio classifiers on two widely known benchmark datasets with different characteristics. Experimental results showed that modification of term frequency factor in supervised term weighting schemes increased the performance of almost all weighting schemes. Also, term weighting schemes using square root function-based term frequency factor (SQRT_TF) are more successful than the ones using term frequency (TF) and logarithmic function-based term frequency (LOG_TF) factors. TF term frequency factor seems as the least effective one among three different term frequency factors according to the experimental results and statistical analysis.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Uysal, A.K.; Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50(1), 104–112 (2014)CrossRef Uysal, A.K.; Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50(1), 104–112 (2014)CrossRef
2.
go back to reference Schneider, K.-M.: Weighted average pointwise mutual information for feature selection in text categorization. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 252–263. Springer (2005) Schneider, K.-M.: Weighted average pointwise mutual information for feature selection in text categorization. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 252–263. Springer (2005)
8.
go back to reference Deng, Z.-H.; Tang, S.-W.; Yang, D.-Q.; Zhang, M.; Li, L.-Y.; Xie, K.Q.: A comparative study on feature weight in text categorization. In: APWeb, pp. 588–597. Springer (2004) Deng, Z.-H.; Tang, S.-W.; Yang, D.-Q.; Zhang, M.; Li, L.-Y.; Xie, K.Q.: A comparative study on feature weight in text categorization. In: APWeb, pp. 588–597. Springer (2004)
9.
go back to reference Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)CrossRef Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)CrossRef
11.
go back to reference Debole, F; Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and its Applications, pp. 81–97. Springer (2004) Debole, F; Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and its Applications, pp. 81–97. Springer (2004)
12.
go back to reference Lertnattee, V.; Theeramunkong, T.: Analysis of inverse class frequency in centroid-based text classification. In: IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004, pp. 1171–1176. IEEE (2004) Lertnattee, V.; Theeramunkong, T.: Analysis of inverse class frequency in centroid-based text classification. In: IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004, pp. 1171–1176. IEEE (2004)
13.
go back to reference Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)CrossRef Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)CrossRef
16.
go back to reference Deisy, C.; Gowri, M.; Baskar, S.; Kalaiarasi, S.; Ramraj, N.: A novel term weighting scheme MIDF for text categorization. J. Eng. Sci. Technol. 5(1), 94–107 (2010) Deisy, C.; Gowri, M.; Baskar, S.; Kalaiarasi, S.; Ramraj, N.: A novel term weighting scheme MIDF for text categorization. J. Eng. Sci. Technol. 5(1), 94–107 (2010)
17.
go back to reference Wei, B.; Feng, B.; He, F.; Fu, X.: An extended supervised term weighting method for text categorization. In: Proceedings of the International Conference on Human-centric Computing 2011 and Embedded and Multimedia Computing 2011. Lecture Notes in Electrical Engineering, pp. 87–99. (2011). https://doi.org/10.1007/978-94-007-2105-0_11 Wei, B.; Feng, B.; He, F.; Fu, X.: An extended supervised term weighting method for text categorization. In: Proceedings of the International Conference on Human-centric Computing 2011 and Embedded and Multimedia Computing 2011. Lecture Notes in Electrical Engineering, pp. 87–99. (2011). https://​doi.​org/​10.​1007/​978-94-007-2105-0_​11
20.
go back to reference Emmanuel, M.; Khatri, S.M.; Babu, D.R.R.: A novel scheme for term weighting in text categorization: positive impact factor. Paper Presented at the 2013 IEEE International Conference on Systems, Man, and Cybernetics (2013) Emmanuel, M.; Khatri, S.M.; Babu, D.R.R.: A novel scheme for term weighting in text categorization: positive impact factor. Paper Presented at the 2013 IEEE International Conference on Systems, Man, and Cybernetics (2013)
29.
go back to reference Kim, H.K.; Kim, M.: Model-induced term-weighting schemes for text classification. Appl. Intell. 45(1), 30–43 (2016)CrossRef Kim, H.K.; Kim, M.: Model-induced term-weighting schemes for text classification. Appl. Intell. 45(1), 30–43 (2016)CrossRef
30.
go back to reference Sabbah, T.; Selamat, A.; Selamat, M.H.; Al-Anzi, F.S.; Viedma, E.H.; Krejcar, O.; Fujita, H.: Modified frequency-based term weighting schemes for text classification. Appl. Soft Comput. 58, 193–206 (2017)CrossRef Sabbah, T.; Selamat, A.; Selamat, M.H.; Al-Anzi, F.S.; Viedma, E.H.; Krejcar, O.; Fujita, H.: Modified frequency-based term weighting schemes for text classification. Appl. Soft Comput. 58, 193–206 (2017)CrossRef
34.
go back to reference Rao, Y.; Li, Q.; Wu, Q.; Xie, H.; Wang, F.L.; Wang, T.: A multi-relational term scheme for first story detection. Neurocomputing 254, 42–52 (2017)CrossRef Rao, Y.; Li, Q.; Wu, Q.; Xie, H.; Wang, F.L.; Wang, T.: A multi-relational term scheme for first story detection. Neurocomputing 254, 42–52 (2017)CrossRef
35.
go back to reference Feng, G.; Li, S.; Sun, T.; Zhang, B.: A probabilistic model derived term weighting scheme for text classification. Pattern Recognit. Lett. 110, 23–29 (2018)CrossRef Feng, G.; Li, S.; Sun, T.; Zhang, B.: A probabilistic model derived term weighting scheme for text classification. Pattern Recognit. Lett. 110, 23–29 (2018)CrossRef
36.
go back to reference Matsuo, R.; Ho, T.B.: Semantic term weighting for clinical texts. Expert Syst. Appl. 114, 543–551 (2018)CrossRef Matsuo, R.; Ho, T.B.: Semantic term weighting for clinical texts. Expert Syst. Appl. 114, 543–551 (2018)CrossRef
37.
go back to reference Li, X.; Zhang, A.; Li, C.; Ouyang, J.; Cai, Y.: Exploring coherent topics by topic modeling with term weighting. Inf. Process. Manag. 54(6), 1345–1358 (2018)CrossRef Li, X.; Zhang, A.; Li, C.; Ouyang, J.; Cai, Y.: Exploring coherent topics by topic modeling with term weighting. Inf. Process. Manag. 54(6), 1345–1358 (2018)CrossRef
38.
go back to reference Santhanakumar, M.; Columbus, C.C.; Jayapriya, K.: Multi term based co-term frequency method for term weighting in information retrieval. Int. J. Bus. Inf. Syst. 28(1), 79–94 (2018) Santhanakumar, M.; Columbus, C.C.; Jayapriya, K.: Multi term based co-term frequency method for term weighting in information retrieval. Int. J. Bus. Inf. Syst. 28(1), 79–94 (2018)
39.
go back to reference Pak, A.; Paroubek, P.; Fraisse, A.; Francopoulo, G.: Normalization of term weighting scheme for sentiment analysis. In: Language and Technology Conference, pp. 116–128. Springer (2011) Pak, A.; Paroubek, P.; Fraisse, A.; Francopoulo, G.: Normalization of term weighting scheme for sentiment analysis. In: Language and Technology Conference, pp. 116–128. Springer (2011)
42.
go back to reference Nguyen, T.T.; Chang, K.; Hui, S.C.: Supervised term weighting centroid-based classifiers for text categorization. Knowl. Inf. Syst. 35(1), 61–85 (2013)CrossRef Nguyen, T.T.; Chang, K.; Hui, S.C.: Supervised term weighting centroid-based classifiers for text categorization. Knowl. Inf. Syst. 35(1), 61–85 (2013)CrossRef
44.
go back to reference Rocchio JJ (1971) Relevance feedback in information retrieval. In: The smart retrieval system-experiments in automatic document processing, pp 313–323 Rocchio JJ (1971) Relevance feedback in information retrieval. In: The smart retrieval system-experiments in automatic document processing, pp 313–323
45.
go back to reference Chang, C.-C.; Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011) Chang, C.-C.; Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
46.
go back to reference Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRef Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRef
Metadata
Title
On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification
Authors
Turgut Dogan
Alper Kursat Uysal
Publication date
21-05-2019
Publisher
Springer Berlin Heidelberg
Published in
Arabian Journal for Science and Engineering / Issue 11/2019
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-019-03920-9

Other articles of this Issue 11/2019

Arabian Journal for Science and Engineering 11/2019 Go to the issue

Research Article - Computer Engineering and Computer Science

UFC: A Unified POI Recommendation Framework

Research Article - Computer Engineering and Computer Science

Bidirectional Encoder–Decoder Model for Arabic Named Entity Recognition

Research Article - Computer Engineering and Computer Science

Bayesian Versus Convolutional Networks for Arabic Handwriting Recognition

Premium Partners