Skip to main content
Top
Published in: Neural Computing and Applications 8/2019

06-01-2018 | Original Article

Term weighting scheme for short-text classification: Twitter corpuses

Authors: Issa Alsmadi, Gan Keng Hoon

Published in: Neural Computing and Applications | Issue 8/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Term weighting is a well-known preprocessing step in text classification that assigns appropriate weights to each term in all documents to enhance the performance of text classification. Most methods proposed in the literature use traditional approaches that emphasize term frequency. These methods perform reasonably with traditional documents. However, these approaches are unsuitable for social network data with limited length and where sparsity and noise are characteristics of short text. A simple supervised term weighting approach, i.e., SW, which considers the special nature of short texts based on term strength and term distribution, is introduced in these study, and its effect in a high-dimensional vector space over term weighting schemes, which represent baseline term weighting in traditional text classification, are assessed. Two datasets are employed with support vector machine, decision tree, k-nearest neighbor, and logistic regression algorithms. The first dataset, Sanders dataset, is a benchmark dataset that includes approximately 5000 tweets in four categories. The second self-collected dataset contains roughly 1500 tweets distributed in six classes collected using Twitter API. The evaluation applied tenfold cross-validation on the labeled data to compare the proposed approach with state-of-the-art methods. The experimental results indicate that supervised approaches perform varied performance, predominantly better than the unsupervised approaches. However, the proposed approach SW has better performance than other ones in terms of accuracy. SW can deal with the limitations of short texts and mitigate the limitations of traditional approaches in the literature, thus improving performance to 80.83 and 90.64 (F-measure) on Sanders dataset and a self-collected dataset, respectively.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
5.
go back to reference Irani D, Webb S, Pu C et al (2010) Study of trend-stuffing on Twitter through text classification. In: CEAS, seventh annual collaboration, electronic messaging, anti-abuse and spam conference, cited 11 Irani D, Webb S, Pu C et al (2010) Study of trend-stuffing on Twitter through text classification. In: CEAS, seventh annual collaboration, electronic messaging, anti-abuse and spam conference, cited 11
6.
go back to reference Speriosu M, Sudan N, Upadhyay S, Baldridge J (2011) Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of conference on empirical methods natural language processing, pp 53–56 Speriosu M, Sudan N, Upadhyay S, Baldridge J (2011) Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of conference on empirical methods natural language processing, pp 53–56
8.
go back to reference Jiang L, Yu M, Zhou M et al (2011) Target-dependent Twitter sentiment classification. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Stroudsburg, pp 151–160 Jiang L, Yu M, Zhou M et al (2011) Target-dependent Twitter sentiment classification. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Stroudsburg, pp 151–160
11.
go back to reference Bekkerman R, Allan J (2003) Using bigrams in text categorization. Work 1003:1–10 Bekkerman R, Allan J (2003) Using bigrams in text categorization. Work 1003:1–10
13.
go back to reference Erenel Z, Altinçay H, Varoǧlu E (2011) Explicit use of term occurrence probabilities for term weighting in text categorization. J Inf Sci Eng 27:819–834 Erenel Z, Altinçay H, Varoǧlu E (2011) Explicit use of term occurrence probabilities for term weighting in text categorization. J Inf Sci Eng 27:819–834
15.
go back to reference Martineau J, Martineau J, Finin T et al (2008) Delta TFIDF: an improved feature space for sentiment analysis. In: Proceedings of second international conference on weblogs and social media (ICWSM), vol 29, pp 490–497 Martineau J, Martineau J, Finin T et al (2008) Delta TFIDF: an improved feature space for sentiment analysis. In: Proceedings of second international conference on weblogs and social media (ICWSM), vol 29, pp 490–497
18.
go back to reference Debole F, Sebastiani F (2003) Supervised team weightening for automated text categorization. Ist di Sci e Tecnol dell’Informazione 784–788 Debole F, Sebastiani F (2003) Supervised team weightening for automated text categorization. Ist di Sci e Tecnol dell’Informazione 784–788
19.
go back to reference Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: IJCAI international joint conference on artificial intelligence, pp 1130–1135 Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: IJCAI international joint conference on artificial intelligence, pp 1130–1135
20.
go back to reference Wu H, Gu X (2014) Reducing over-weighting in supervised term weighting for sentiment analysis. In: Proceedings of COLING 2014, 25th international conference on computational linguistics technical papers, pp 1322–1330 Wu H, Gu X (2014) Reducing over-weighting in supervised term weighting for sentiment analysis. In: Proceedings of COLING 2014, 25th international conference on computational linguistics technical papers, pp 1322–1330
25.
go back to reference Timonen M (2013) Term weighting in short documents for document categorization, keyword extraction and query expansion. Publications A. [online] 2013: University of Helsinki, Finland. http://www.cs.helsinki.fi Timonen M (2013) Term weighting in short documents for document categorization, keyword extraction and query expansion. Publications A. [online] 2013: University of Helsinki, Finland. http://​www.​cs.​helsinki.​fi
26.
go back to reference Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13:415–425CrossRef Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13:415–425CrossRef
28.
go back to reference Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York
31.
go back to reference Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88 Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Metadata
Title
Term weighting scheme for short-text classification: Twitter corpuses
Authors
Issa Alsmadi
Gan Keng Hoon
Publication date
06-01-2018
Publisher
Springer London
Published in
Neural Computing and Applications / Issue 8/2019
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-017-3298-8

Other articles of this Issue 8/2019

Neural Computing and Applications 8/2019 Go to the issue

Premium Partner