Skip to main content
Erschienen in: Knowledge and Information Systems 9/2021

07.07.2021 | Regular Paper

On entropy-based term weighting schemes for text categorization

verfasst von: Tao Wang, Yi Cai, Ho-fung Leung, Raymond Y. K. Lau, Haoran Xie, Qing Li

Erschienen in: Knowledge and Information Systems | Ausgabe 9/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alshawabkeh M, Aslam JA, Dy JG, Kaeli D (2012) Feature weighting and selection using hypothesis margin of boosting. In: 2012 IEEE 12th international conference on data mining. IEEE Alshawabkeh M, Aslam JA, Dy JG, Kaeli D (2012) Feature weighting and selection using hypothesis margin of boosting. In: 2012 IEEE 12th international conference on data mining. IEEE
2.
Zurück zum Zitat Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst (TOIS) 12(3):233–251CrossRef Apté C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Syst (TOIS) 12(3):233–251CrossRef
3.
Zurück zum Zitat Arora S, Liang Y, Ma T (2019) A simple but tough-to-beat baseline for sentence embeddings. In: 5th international conference on learning representations, ICLR 2017 Arora S, Liang Y, Ma T (2019) A simple but tough-to-beat baseline for sentence embeddings. In: 5th international conference on learning representations, ICLR 2017
4.
Zurück zum Zitat Batal I, Hauskrecht M (2009) Boosting knn text classification accuracy by using supervised term weighting schemes. In: CIKM Batal I, Hauskrecht M (2009) Boosting knn text classification accuracy by using supervised term weighting schemes. In: CIKM
5.
Zurück zum Zitat Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef
6.
Zurück zum Zitat Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. arXiv preprint arXiv:1607.06520 Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. arXiv preprint arXiv:​1607.​06520
7.
Zurück zum Zitat Buckley C, Salton G, Allan J, Singhal A (1995) Automatic query expansion using smart: Trec 3. NIST special publication sp Buckley C, Salton G, Allan J, Singhal A (1995) Automatic query expansion using smart: Trec 3. NIST special publication sp
8.
Zurück zum Zitat Chang Y, Li Y, Ding A, Dy J (2016) A robust-equitable copula dependence measure for feature selection. In: Proceedings of the 19th international conference on artificial intelligence and statistics, pp 84–92 Chang Y, Li Y, Ding A, Dy J (2016) A robust-equitable copula dependence measure for feature selection. In: Proceedings of the 19th international conference on artificial intelligence and statistics, pp 84–92
9.
Zurück zum Zitat Chen K, Zhang Z, Long J, Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl 66:245–260CrossRef Chen K, Zhang Z, Long J, Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl 66:245–260CrossRef
11.
Zurück zum Zitat Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp 310–318 Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp 310–318
12.
Zurück zum Zitat Chen W, Yuan X, Zhang S, Wu J, Zhang Y, Wang Y (2020) Ferryman at semeval-2020 task 3: bert with tfidf-weighting for predicting the effect of context in word similarity. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 281–285 Chen W, Yuan X, Zhang S, Wu J, Zhang Y, Wang Y (2020) Ferryman at semeval-2020 task 3: bert with tfidf-weighting for predicting the effect of context in word similarity. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 281–285
13.
Zurück zum Zitat Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH
14.
Zurück zum Zitat Cover Thomas M, Thomas Joy A (2012) Elements of information theory. Wiley, HobokenMATH Cover Thomas M, Thomas Joy A (2012) Elements of information theory. Wiley, HobokenMATH
15.
Zurück zum Zitat Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292MATH Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292MATH
16.
Zurück zum Zitat Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications
17.
Zurück zum Zitat Deng Z-H, Tang S-W, Yang D-Q et al (2004) A comparative study on feature weight in text categorization. In: Advanced web technologies and applications. Springer, pp 588–597 Deng Z-H, Tang S-W, Yang D-Q et al (2004) A comparative study on feature weight in text categorization. In: Advanced web technologies and applications. Springer, pp 588–597
18.
Zurück zum Zitat Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805
19.
Zurück zum Zitat Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn
20.
Zurück zum Zitat Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput
21.
Zurück zum Zitat Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management. ACM, pp 148–155 Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management. ACM, pp 148–155
22.
Zurück zum Zitat Dumais ST (1991) Improving the retrieval of information from external sources. Behav Res Methods Instrum Comput Dumais ST (1991) Improving the retrieval of information from external sources. Behav Res Methods Instrum Comput
23.
Zurück zum Zitat Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74 Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74
24.
Zurück zum Zitat Efstathiou V, Chatzilenas C, Spinellis D (2018) Word embeddings for the software engineering domain. In: Proceedings of the 15th international conference on mining software repositories, pp 38–41 Efstathiou V, Chatzilenas C, Spinellis D (2018) Word embeddings for the software engineering domain. In: Proceedings of the 15th international conference on mining software repositories, pp 38–41
25.
Zurück zum Zitat Fan R-E, Chang K-W, Hsieh C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874MATH Fan R-E, Chang K-W, Hsieh C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874MATH
26.
Zurück zum Zitat Fauzi MA (2019) Word2vec model for sentiment analysis of product reviews in indonesian language. Int J Electr Comput Eng 9(1):525 Fauzi MA (2019) Word2vec model for sentiment analysis of product reviews in indonesian language. Int J Electr Comput Eng 9(1):525
27.
Zurück zum Zitat Ferrero J, Agnes F, Besacier L, Schwab D (2017) Using word embedding for cross-language plagiarism detection. EACL 2017, pp 415 Ferrero J, Agnes F, Besacier L, Schwab D (2017) Using word embedding for cross-language plagiarism detection. EACL 2017, pp 415
28.
Zurück zum Zitat Fisher MJ, Fieldsend JE, Everson RM (2004) Precision and recall optimisation for information access tasks Fisher MJ, Fieldsend JE, Everson RM (2004) Precision and recall optimisation for information access tasks
29.
Zurück zum Zitat Gonen H, Goldberg Y (2019) Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862 Gonen H, Goldberg Y (2019) Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:​1903.​03862
30.
Zurück zum Zitat Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res
31.
Zurück zum Zitat Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S (2016) Combining supervised term-weighting metrics for svm text classification with extended term representation. Knowl Inf Syst, pp 1–23 Haddoud M, Mokhtari A, Lecroq T, Abdeddaïm S (2016) Combining supervised term-weighting metrics for svm text classification with extended term representation. Knowl Inf Syst, pp 1–23
32.
Zurück zum Zitat Han E-H et al (2001) Text categorization using weight adjusted k-nearest neighbor classification Han E-H et al (2001) Text categorization using weight adjusted k-nearest neighbor classification
33.
Zurück zum Zitat Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Networks Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Networks
35.
Zurück zum Zitat Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin
36.
Zurück zum Zitat Jones Karen Sparck (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21CrossRef Jones Karen Sparck (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21CrossRef
37.
Zurück zum Zitat Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint arXiv:1506.06726 Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint arXiv:​1506.​06726
38.
Zurück zum Zitat Ko Y (2012) A study of term weighting schemes using class information for text classification. ACM, In SIGIR Ko Y (2012) A study of term weighting schemes using class information for text classification. ACM, In SIGIR
39.
Zurück zum Zitat Ko Y (2015) A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. J Assoc Inf Sci Technol Ko Y (2015) A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. J Assoc Inf Sci Technol
40.
Zurück zum Zitat Lan M, Tan CL, Low HB (2006) Proposing a new term weighting scheme for text categorization. In: AAAI Lan M, Tan CL, Low HB (2006) Proposing a new term weighting scheme for text categorization. In: AAAI
41.
Zurück zum Zitat Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell
42.
Zurück zum Zitat Largeron C, Moulin C, Géry M (2011) Entropy based feature selection for text categorization. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 924–928 Largeron C, Moulin C, Géry M (2011) Entropy based feature selection for text categorization. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 924–928
43.
Zurück zum Zitat Le Quoc V, Tomas M (2014) Distributed representations of sentences and documents. ICML 14:1188–1196 Le Quoc V, Tomas M (2014) Distributed representations of sentences and documents. ICML 14:1188–1196
44.
Zurück zum Zitat Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space? Mac Learn 46(1–3):423–444CrossRef Leopold E, Kindermann J (2002) Text categorization with support vector machines. How to represent texts in input space? Mac Learn 46(1–3):423–444CrossRef
45.
Zurück zum Zitat Lewis DD (1991) Evaluating text categorization. In: Proceedings of speech and natural language workshop. Defense Advanced Research Projects Agency, Morgan Kaufmann, February, pages 312–318 Lewis DD (1991) Evaluating text categorization. In: Proceedings of speech and natural language workshop. Defense Advanced Research Projects Agency, Morgan Kaufmann, February, pages 312–318
46.
Zurück zum Zitat Li Y, Zheng R, Tian T, Hu Z, Iyer R, Sycara K (2016) Joint embedding of hierarchical categories and entities for concept categorization and dataless classification. In: The 26th international conference on computational linguistics (COLING) Li Y, Zheng R, Tian T, Hu Z, Iyer R, Sycara K (2016) Joint embedding of hierarchical categories and entities for concept categorization and dataless classification. In: The 26th international conference on computational linguistics (COLING)
47.
Zurück zum Zitat Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev
48.
Zurück zum Zitat Luo J, Shan H, Zhang G, Yuan G, Zhang S, Yan F, Li Z (2021) Exploiting syntactic and semantic information for textual similarity estimation. Math Probl Eng Luo J, Shan H, Zhang G, Yuan G, Zhang S, Yan F, Li Z (2021) Exploiting syntactic and semantic information for textual similarity estimation. Math Probl Eng
49.
Zurück zum Zitat Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl
50.
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.03781 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​03781
51.
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
52.
Zurück zum Zitat Mladeni’c D, Grobelnik M (1998) Feature selection for classification based on text hierarchy. In: Text and the web, conference on automated learning and discovery CONALD-98. Citeseer Mladeni’c D, Grobelnik M (1998) Feature selection for classification based on text hierarchy. In: Text and the web, conference on automated learning and discovery CONALD-98. Citeseer
53.
Zurück zum Zitat Manal M, Nazlia O (2020) Question classification based on bloom’s taxonomy cognitive domain using modified tf-idf and word2vec. PLoS ONE 15(3):e0230442CrossRef Manal M, Nazlia O (2020) Question classification based on bloom’s taxonomy cognitive domain using modified tf-idf and word2vec. PLoS ONE 15(3):e0230442CrossRef
54.
Zurück zum Zitat Nam J, Mencía ELJ (2016) All-in text: learning document, label, and word representations jointly. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI Press, pp 1948–1954 Nam J, Mencía ELJ (2016) All-in text: learning document, label, and word representations jointly. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI Press, pp 1948–1954
55.
Zurück zum Zitat Nguyen XV, Chan J, Romano S, Bailey J (2014) Effective global approaches for mutual information based feature selection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 512–521 Nguyen XV, Chan J, Romano S, Bailey J (2014) Effective global approaches for mutual information based feature selection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 512–521
56.
Zurück zum Zitat Paik JH (2013) A novel tf-idf weighting scheme for effective ranking. In: SIGIR. ACM, pp 343–352 Paik JH (2013) A novel tf-idf weighting scheme for effective ranking. In: SIGIR. ACM, pp 343–352
57.
Zurück zum Zitat Papakyriakopoulos O, Hegelich S, Serrano JCM, Marco F (2020) Bias in word embeddings. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, pp 446–457 Papakyriakopoulos O, Hegelich S, Serrano JCM, Marco F (2020) Bias in word embeddings. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, pp 446–457
58.
Zurück zum Zitat Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRef
59.
Zurück zum Zitat Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
60.
Zurück zum Zitat Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web. ACM, pp 91–100 Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web. ACM, pp 91–100
61.
Zurück zum Zitat Powers DM (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation Powers DM (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation
62.
Zurück zum Zitat Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. arXiv preprint arXiv:2003.08271 Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. arXiv preprint arXiv:​2003.​08271
63.
Zurück zum Zitat Quan X, Wenyin L, Qiu B (2011) Term weighting schemes for question categorization. IEEE Trans Pattern Anal Mach Intell 33(5):1009–1021CrossRef Quan X, Wenyin L, Qiu B (2011) Term weighting schemes for question categorization. IEEE Trans Pattern Anal Mach Intell 33(5):1009–1021CrossRef
64.
Zurück zum Zitat Stephen R (2004) Understanding inverse document frequency: on theoretical arguments for idf. J Doc Stephen R (2004) Understanding inverse document frequency: on theoretical arguments for idf. J Doc
65.
Zurück zum Zitat Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Manag Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Manag
66.
Zurück zum Zitat Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM
67.
Zurück zum Zitat Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372CrossRef Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372CrossRef
68.
Zurück zum Zitat Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRef Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRef
69.
Zurück zum Zitat Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Comput Commun (review) Shannon CE (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Comput Commun (review)
70.
Zurück zum Zitat Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C et al (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631. Citeseer, pp 1642 Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C et al (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631. Citeseer, pp 1642
71.
Zurück zum Zitat Song L, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13(May):1393–1434MathSciNetMATH Song L, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13(May):1393–1434MathSciNetMATH
72.
Zurück zum Zitat Soucy P, Mineau GW (2005) Beyond tfidf weighting for text categorization in the vector space model. In: IJCAI Soucy P, Mineau GW (2005) Beyond tfidf weighting for text categorization in the vector space model. In: IJCAI
73.
Zurück zum Zitat Swinger N, De-Arteaga M et al (2019) What are the biases in my word embedding? In: Proceedings of the 2019 AAAI/ACM conference on AI, ethics, and society, pp 305–311 Swinger N, De-Arteaga M et al (2019) What are the biases in my word embedding? In: Proceedings of the 2019 AAAI/ACM conference on AI, ethics, and society, pp 305–311
74.
Zurück zum Zitat Tang J, Qu M, Mei Q (2015) Pte: Predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1165–1174 Tang J, Qu M, Mei Q (2015) Pte: Predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1165–1174
75.
Zurück zum Zitat Wang T, Cai Y, Leung H-F, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: 2015 IEEE 27th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 325–332 Wang T, Cai Y, Leung H-F, Cai Z, Min H (2015) Entropy-based term weighting schemes for text categorization in vsm. In: 2015 IEEE 27th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 325–332
76.
Zurück zum Zitat Warrens MJ (2008) On association coefficients for 2\(\times \) 2 tables and properties that do not depend on the marginal distributions. Psychometrika Warrens MJ (2008) On association coefficients for 2\(\times \) 2 tables and properties that do not depend on the marginal distributions. Psychometrika
77.
Zurück zum Zitat Wei B, Feng B, He F, Fu X (2011) An extended supervised term weighting method for text categorization. In: Proceedings of the international conference on human-centric computing 2011 and embedded and multimedia computing 2011. Springer Wei B, Feng B, He F, Fu X (2011) An extended supervised term weighting method for text categorization. In: Proceedings of the international conference on human-centric computing 2011 and embedded and multimedia computing 2011. Springer
78.
Zurück zum Zitat Wu H, Gu X (2016) Balancing between over-weighting and under-weighting in supervised term weighting. arXiv preprint arXiv:1604.04007 Wu H, Gu X (2016) Balancing between over-weighting and under-weighting in supervised term weighting. arXiv preprint arXiv:​1604.​04007
79.
Zurück zum Zitat Wu H, Salton G (1981) A comparison of search term weighting: term relevance vs. inverse document frequency. In: ACM SIGIR Forum, vol 16. ACM, pp 30–39 Wu H, Salton G (1981) A comparison of search term weighting: term relevance vs. inverse document frequency. In: ACM SIGIR Forum, vol 16. ACM, pp 30–39
80.
Zurück zum Zitat Wu L, Yen IEH, Xu K, Xu F, Balakrishnan A, Chen P-Y, Ravikumar P, Witbrock MJ (2018) Word mover’s embedding: from word2vec to document embedding. arXiv preprint arXiv:1811.01713 Wu L, Yen IEH, Xu K, Xu F, Balakrishnan A, Chen P-Y, Ravikumar P, Witbrock MJ (2018) Word mover’s embedding: from word2vec to document embedding. arXiv preprint arXiv:​1811.​01713
81.
Zurück zum Zitat Xiong M, Li R, Li Y, Yang Q (2018) Self-inhibition residual convolutional networks for Chinese sentence classification. In: International conference on neural information processing. Springer, pp 425–436 Xiong M, Li R, Li Y, Yang Q (2018) Self-inhibition residual convolutional networks for Chinese sentence classification. In: International conference on neural information processing. Springer, pp 425–436
82.
Zurück zum Zitat Yang Y, Liu X (1999) A re-examination of text categorization methods. In: SIGIR. ACM, pp 42–49 Yang Y, Liu X (1999) A re-examination of text categorization methods. In: SIGIR. ACM, pp 42–49
83.
Zurück zum Zitat Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML
84.
Zurück zum Zitat Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res
85.
Zurück zum Zitat Yuan H, Wang Y, Feng X, Sun S (2018) Sentiment analysis based on weighted word2vec and att-lstm. In: Proceedings of the 2018 2nd international conference on computer science and artificial intelligence, pp 420–424 Yuan H, Wang Y, Feng X, Sun S (2018) Sentiment analysis based on weighted word2vec and att-lstm. In: Proceedings of the 2018 2nd international conference on computer science and artificial intelligence, pp 420–424
86.
Zurück zum Zitat Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22:179–214CrossRef Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22:179–214CrossRef
87.
Zurück zum Zitat Zhang D, Yin J, Zhu X, Chengqi Z (2018) A survey. IEEE Trans Big Data Netw Represent Learn Zhang D, Yin J, Zhu X, Chengqi Z (2018) A survey. IEEE Trans Big Data Netw Represent Learn
88.
Zurück zum Zitat Zhang S, Jin X, Shen D, Cao B, Ding X, Zhang X (2013) Short text classification by detecting information path. In: Proceedings of the 22nd ACM international conference on conference on information & knowledge management. ACM, pp 727–732 Zhang S, Jin X, Shen D, Cao B, Ding X, Zhang X (2013) Short text classification by detecting information path. In: Proceedings of the 22nd ACM international conference on conference on information & knowledge management. ACM, pp 727–732
89.
Zurück zum Zitat Zhao J, Wang T, Yatskar M, Cotterell R, Ordonez V, Chang K-W (2019) Gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.03310 Zhao J, Wang T, Yatskar M, Cotterell R, Ordonez V, Chang K-W (2019) Gender bias in contextualized word embeddings. arXiv preprint arXiv:​1904.​03310
90.
91.
Zurück zum Zitat Zhao K, Hassan H, Auli M (2015) Learning translation models from monolingual continuous representations. In: Proceedings of NAACL Zhao K, Hassan H, Auli M (2015) Learning translation models from monolingual continuous representations. In: Proceedings of NAACL
Metadaten
Titel
On entropy-based term weighting schemes for text categorization
verfasst von
Tao Wang
Yi Cai
Ho-fung Leung
Raymond Y. K. Lau
Haoran Xie
Qing Li
Publikationsdatum
07.07.2021
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 9/2021
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-021-01581-5

Weitere Artikel der Ausgabe 9/2021

Knowledge and Information Systems 9/2021 Zur Ausgabe