Top

Knowledge and Information Systems

Published in:

07-07-2021 | Regular Paper

On entropy-based term weighting schemes for text categorization

Authors: Tao Wang, Yi Cai, Ho-fung Leung, Raymond Y. K. Lau, Haoran Xie, Qing Li

Published in: Knowledge and Information Systems | Issue 9/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.

previous article On augmenting database schemas by latent visual attributes

next article A generative model for time evolving networks

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

http://www.daviddlewis.com/resources/testcollections/reuters21578/.

http://jwebpro.sourceforge.net/data-web-snippets.tar.gz.

http://qwone.com/~jason/20Newsgroups/.

http://web.ist.utl.pt/~acardoso/datasets/.

http://disi.unitn.it/moschitti/corpora.htm.

http://tartarus.org/martin/PorterStemmer/.

http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

A negative correlation leads to a value of \({or}<1\), while a positive one leads to a value of \({or}>1\).

Alshawabkeh M, Aslam JA, Dy JG, Kaeli D (2012) Feature weighting and selection using hypothesis margin of boosting. In: 2012 IEEE 12th international conference on data mining. IEEE

Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst (TOIS) 12(3):233–251CrossRef

Arora S, Liang Y, Ma T (2019) A simple but tough-to-beat baseline for sentence embeddings. In: 5th international conference on learning representations, ICLR 2017

Batal I, Hauskrecht M (2009) Boosting knn text classification accuracy by using supervised term weighting schemes. In: CIKM

Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef

Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. arXiv preprint arXiv:1607.06520

Buckley C, Salton G, Allan J, Singhal A (1995) Automatic query expansion using smart: Trec 3. NIST special publication sp

Chang Y, Li Y, Ding A, Dy J (2016) A robust-equitable copula dependence measure for feature selection. In: Proceedings of the 19th international conference on artificial intelligence and statistics, pp 84–92

Chen K, Zhang Z, Long J, Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl 66:245–260CrossRef

10.

Chen M (2017) Efficient vector representation for documents through corruption. arXiv preprint arXiv:1707.02377

11.

Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp 310–318

12.

Chen W, Yuan X, Zhang S, Wu J, Zhang Y, Wang Y (2020) Ferryman at semeval-2020 task 3: bert with tfidf-weighting for predicting the effect of context in word similarity. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 281–285

13.

Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH

14.

Cover Thomas M, Thomas Joy A (2012) Elements of information theory. Wiley, HobokenMATH

15.

Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292MATH

16.

Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications

17.

Deng Z-H, Tang S-W, Yang D-Q et al (2004) A comparative study on feature weight in text categorization. In: Advanced web technologies and applications. Springer, pp 588–597

18.

Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

19.

Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn

20.

Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput

21.

Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management. ACM, pp 148–155

22.

Dumais ST (1991) Improving the retrieval of information from external sources. Behav Res Methods Instrum Comput

23.

Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74

24.

Efstathiou V, Chatzilenas C, Spinellis D (2018) Word embeddings for the software engineering domain. In: Proceedings of the 15th international conference on mining software repositories, pp 38–41

25.

Fan R-E, Chang K-W, Hsieh C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874MATH

26.

Fauzi MA (2019) Word2vec model for sentiment analysis of product reviews in indonesian language. Int J Electr Comput Eng 9(1):525

27.

Ferrero J, Agnes F, Besacier L, Schwab D (2017) Using word embedding for cross-language plagiarism detection. EACL 2017, pp 415

28.

Fisher MJ, Fieldsend JE, Everson RM (2004) Precision and recall optimisation for information access tasks

29.

Gonen H, Goldberg Y (2019) Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862

30.

Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res

31.

Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S (2016) Combining supervised term-weighting metrics for svm text classification with extended term representation. Knowl Inf Syst, pp 1–23

32.

Han E-H et al (2001) Text categorization using weight adjusted k-nearest neighbor classification

33.

Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Networks

34.

Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620MathSciNetCrossRef

35.

Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin

36.

Jones Karen Sparck (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21CrossRef

37.

Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint arXiv:1506.06726

38.

Ko Y (2012) A study of term weighting schemes using class information for text classification. ACM, In SIGIR

39.

Ko Y (2015) A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. J Assoc Inf Sci Technol

40.

Lan M, Tan CL, Low HB (2006) Proposing a new term weighting scheme for text categorization. In: AAAI

41.

Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell

42.

Largeron C, Moulin C, Géry M (2011) Entropy based feature selection for text categorization. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 924–928

43.

Le Quoc V, Tomas M (2014) Distributed representations of sentences and documents. ICML 14:1188–1196

44.

Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space? Mac Learn 46(1–3):423–444CrossRef

45.

Lewis DD (1991) Evaluating text categorization. In: Proceedings of speech and natural language workshop. Defense Advanced Research Projects Agency, Morgan Kaufmann, February, pages 312–318

46.

Li Y, Zheng R, Tian T, Hu Z, Iyer R, Sycara K (2016) Joint embedding of hierarchical categories and entities for concept categorization and dataless classification. In: The 26th international conference on computational linguistics (COLING)

47.

Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev

48.

Luo J, Shan H, Zhang G, Yuan G, Zhang S, Yan F, Li Z (2021) Exploiting syntactic and semantic information for textual similarity estimation. Math Probl Eng

49.

Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl

50.

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.03781

51.

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

52.

Mladeni’c D, Grobelnik M (1998) Feature selection for classification based on text hierarchy. In: Text and the web, conference on automated learning and discovery CONALD-98. Citeseer

53.

Manal M, Nazlia O (2020) Question classification based on bloom’s taxonomy cognitive domain using modified tf-idf and word2vec. PLoS ONE 15(3):e0230442CrossRef

54.

Nam J, Mencía ELJ (2016) All-in text: learning document, label, and word representations jointly. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI Press, pp 1948–1954

55.

Nguyen XV, Chan J, Romano S, Bailey J (2014) Effective global approaches for mutual information based feature selection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 512–521

56.

Paik JH (2013) A novel tf-idf weighting scheme for effective ranking. In: SIGIR. ACM, pp 343–352

57.

Papakyriakopoulos O, Hegelich S, Serrano JCM, Marco F (2020) Bias in word embeddings. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, pp 446–457

58.

Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef

59.

Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

60.

Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web. ACM, pp 91–100

61.

Powers DM (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation

62.

Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. arXiv preprint arXiv:2003.08271

63.

Quan X, Wenyin L, Qiu B (2011) Term weighting schemes for question categorization. IEEE Trans Pattern Anal Mach Intell 33(5):1009–1021CrossRef

64.

Stephen R (2004) Understanding inverse document frequency: on theoretical arguments for idf. J Doc

65.

Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Manag

66.

Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM

67.

Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372CrossRef

68.

Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRef

69.

Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Comput Commun (review)

70.

Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C et al (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631. Citeseer, pp 1642

71.

Song L, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13(May):1393–1434MathSciNetMATH

72.

Soucy P, Mineau GW (2005) Beyond tfidf weighting for text categorization in the vector space model. In: IJCAI

73.

Swinger N, De-Arteaga M et al (2019) What are the biases in my word embedding? In: Proceedings of the 2019 AAAI/ACM conference on AI, ethics, and society, pp 305–311

74.

Tang J, Qu M, Mei Q (2015) Pte: Predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1165–1174

75.

Wang T, Cai Y, Leung H-F, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: 2015 IEEE 27th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 325–332

76.

Warrens MJ (2008) On association coefficients for 2\(\times \) 2 tables and properties that do not depend on the marginal distributions. Psychometrika

77.

Wei B, Feng B, He F, Fu X (2011) An extended supervised term weighting method for text categorization. In: Proceedings of the international conference on human-centric computing 2011 and embedded and multimedia computing 2011. Springer

78.

Wu H, Gu X (2016) Balancing between over-weighting and under-weighting in supervised term weighting. arXiv preprint arXiv:1604.04007

79.

Wu H, Salton G (1981) A comparison of search term weighting: term relevance vs. inverse document frequency. In: ACM SIGIR Forum, vol 16. ACM, pp 30–39

80.

Wu L, Yen IEH, Xu K, Xu F, Balakrishnan A, Chen P-Y, Ravikumar P, Witbrock MJ (2018) Word mover’s embedding: from word2vec to document embedding. arXiv preprint arXiv:1811.01713

81.

Xiong M, Li R, Li Y, Yang Q (2018) Self-inhibition residual convolutional networks for Chinese sentence classification. In: International conference on neural information processing. Springer, pp 425–436

82.

Yang Y, Liu X (1999) A re-examination of text categorization methods. In: SIGIR. ACM, pp 42–49

83.

Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML

84.

Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res

85.

Yuan H, Wang Y, Feng X, Sun S (2018) Sentiment analysis based on weighted word2vec and att-lstm. In: Proceedings of the 2018 2nd international conference on computer science and artificial intelligence, pp 420–424

86.

Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22:179–214CrossRef

87.

Zhang D, Yin J, Zhu X, Chengqi Z (2018) A survey. IEEE Trans Big Data Netw Represent Learn

88.

Zhang S, Jin X, Shen D, Cao B, Ding X, Zhang X (2013) Short text classification by detecting information path. In: Proceedings of the 22nd ACM international conference on conference on information & knowledge management. ACM, pp 727–732

89.

Zhao J, Wang T, Yatskar M, Cotterell R, Ordonez V, Chang K-W (2019) Gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.03310

90.

Zhao J, Zhou Y, Li Z, Wang W, Chang K-W (2018) Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496

91.

Zhao K, Hassan H, Auli M (2015) Learning translation models from monolingual continuous representations. In: Proceedings of NAACL

Title: On entropy-based term weighting schemes for text categorization
Authors: Tao Wang
Yi Cai
Ho-fung Leung
Raymond Y. K. Lau
Haoran Xie
Qing Li
Publication date: 07-07-2021
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 9/2021
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-021-01581-5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 9/2021

Imputing sentiment intensity for SaaS service quality aspects using T-nearest neighbors with correlation-weighted Euclidean distance

Privacy protection of user profiles in online search via semantic randomization

Tweet-scan-post: a system for analysis of sensitive private data disclosure in online social media

Toward data-driven solutions to interactive dynamic influence diagrams

Deep reinforcement learning-based resource allocation and seamless handover in multi-access edge computing based on SDN

On augmenting database schemas by latent visual attributes

Premium Partner