Top

Neural Computing and Applications

Published in:

06-01-2018 | Original Article

Term weighting scheme for short-text classification: Twitter corpuses

Authors: Issa Alsmadi, Gan Keng Hoon

Published in: Neural Computing and Applications | Issue 8/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Term weighting is a well-known preprocessing step in text classification that assigns appropriate weights to each term in all documents to enhance the performance of text classification. Most methods proposed in the literature use traditional approaches that emphasize term frequency. These methods perform reasonably with traditional documents. However, these approaches are unsuitable for social network data with limited length and where sparsity and noise are characteristics of short text. A simple supervised term weighting approach, i.e., SW, which considers the special nature of short texts based on term strength and term distribution, is introduced in these study, and its effect in a high-dimensional vector space over term weighting schemes, which represent baseline term weighting in traditional text classification, are assessed. Two datasets are employed with support vector machine, decision tree, k-nearest neighbor, and logistic regression algorithms. The first dataset, Sanders dataset, is a benchmark dataset that includes approximately 5000 tweets in four categories. The second self-collected dataset contains roughly 1500 tweets distributed in six classes collected using Twitter API. The evaluation applied tenfold cross-validation on the labeled data to compare the proposed approach with state-of-the-art methods. The experimental results indicate that supervised approaches perform varied performance, predominantly better than the unsupervised approaches. However, the proposed approach SW has better performance than other ones in terms of accuracy. SW can deal with the limitations of short texts and mitigate the limitations of traditional approaches in the literature, thus improving performance to 80.83 and 90.64 (F-measure) on Sanders dataset and a self-collected dataset, respectively.

previous article Face recognition using AMVP and WSRC under variable illumination and pose

next article VLCI approach for optimal capacitors allocation in distribution networks based on hybrid PSOGSA optimization algorithm

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Miller Z, Dickinson B, Deitrick W et al (2014) Twitter spammer detection using data stream clustering. Inf Sci (NY) 260:64–73. https://doi.org/10.1016/j.ins.2013.11.016 CrossRef

Faguo Z, Fan Z, Bingru Y, Xingang Y (2010) Research on short text classification algorithm based on statistics and rules. In: 2010 Third international symposium on electronic commerce and security, pp 3–7. https://doi.org/10.1109/isecs.2010.9

Quan X, Wenyin L, Qiu B (2011) Term weighting schemes for question categorization. IEEE Trans Pattern Anal Mach Intell 33:1009–1021. https://doi.org/10.1109/TPAMI.2010.154 CrossRef

Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI international joint conference on artificial intelligence, pp 1776–1781. https://doi.org/10.5591/978-1-57735-516-8/ijcai11-298

Irani D, Webb S, Pu C et al (2010) Study of trend-stuffing on Twitter through text classification. In: CEAS, seventh annual collaboration, electronic messaging, anti-abuse and spam conference, cited 11

Speriosu M, Sudan N, Upadhyay S, Baldridge J (2011) Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of conference on empirical methods natural language processing, pp 53–56

Tsuchida Y, Yoshioka M, Yanagimoto H, Isaji S (2012) Incident detection from Tweets by neural network with GPGPU. In: 2012 IEEE international conference on fuzzy systems, pp 1–6. https://doi.org/10.1109/fuzz-ieee.2012.6251239

Jiang L, Yu M, Zhou M et al (2011) Target-dependent Twitter sentiment classification. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Stroudsburg, pp 151–160

Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38:12708–12716. https://doi.org/10.1016/j.eswa.2011.04.058 CrossRef

10.

Scott S, Matwin S (1999) Feature engineering for text classification. Mach Learn Work 6:1–13. https://doi.org/10.1016/j.jbi.2012.04.010

11.

Bekkerman R, Allan J (2003) Using bigrams in text categorization. Work 1003:1–10

12.

Tsai FS, Kwee AT (2011) Experiments in term weighting for novelty mining. Expert Syst Appl 38:14094–14101. https://doi.org/10.1016/j.eswa.2011.04.218

13.

Erenel Z, Altinçay H, Varoǧlu E (2011) Explicit use of term occurrence probabilities for term weighting in text categorization. J Inf Sci Eng 27:819–834

14.

CLiao YLY (2010) A text classification model based on training sample selection and weight adjustment. In: 2010 2nd International conference on advanced computer control ICACC. https://doi.org/10.1109/icacc.2010.5486615

15.

Martineau J, Martineau J, Finin T et al (2008) Delta TFIDF: an improved feature space for sentiment analysis. In: Proceedings of second international conference on weblogs and social media (ICWSM), vol 29, pp 490–497

16.

Shi K, He J, Liu H et al (2011) Efficient text classification method based on improved term reduction and term weighting. J China Univ Posts Telecommun 18:131–135. https://doi.org/10.1016/S1005-8885(10)60196-3 CrossRef

17.

Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci (Ny) 236:109–125. https://doi.org/10.1016/j.ins.2013.02.029 CrossRef

18.

Debole F, Sebastiani F (2003) Supervised team weightening for automated text categorization. Ist di Sci e Tecnol dell’Informazione 784–788

19.

Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: IJCAI international joint conference on artificial intelligence, pp 1130–1135

20.

Wu H, Gu X (2014) Reducing over-weighting in supervised term weighting for sentiment analysis. In: Proceedings of COLING 2014, 25th international conference on computational linguistics technical papers, pp 1322–1330

21.

Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31:721–735. https://doi.org/10.1109/TPAMI.2008.110 CrossRef

22.

Deng Z-H, Luo K-H, Yu H-L (2014) A study of supervised term weighting scheme for sentiment analysis. Expert Syst Appl 41:3506–3513. https://doi.org/10.1016/j.eswa.2013.10.056 CrossRef

23.

Man Y (2014) Feature extension for short text categorization using frequent term sets. Procedia Comput Sci 31:663–670. https://doi.org/10.1016/j.procs.2014.05.314 CrossRef

24.

da Silva NFF, Hruschka ER, Hruschka ER (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179. https://doi.org/10.1016/j.dss.2014.07.003 CrossRef

25.

Timonen M (2013) Term weighting in short documents for document categorization, keyword extraction and query expansion. Publications A. [online] 2013: University of Helsinki, Finland. http://www.cs.helsinki.fi

26.

Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13:415–425CrossRef

27.

Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47. https://doi.org/10.1145/505282.505283 CrossRef

28.

Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York

29.

Günal S (2012) Hybrid feature selection for text classification. Turk J Electr Eng Comput Sci 20:1296–1311. https://doi.org/10.3906/elk-1101-1064

30.

Uǧuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 24:1024–1032. https://doi.org/10.1016/j.knosys.2011.04.014 CrossRef

31.

Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88

32.

Lin J, Kolcz A (2012) Large-scale machine learning at Twitter. In: Proceedings of 2012 international conference of data management SIGMOD, vol 12, pp 793–804. https://doi.org/10.1145/2213836.2213958

33.

Taşcı Ş, Güngör T (2013) Comparison of text feature selection policies and using an adaptive framework. Expert Syst Appl 40:4871–4886. https://doi.org/10.1016/j.eswa.2013.02.019 CrossRef

34.

Emmanuel M, Khatri SM, Babu DRR (2013) A novel scheme for term weighting in text categorization: positive impact factor. In: 2013 IEEE international conference on systems, man, and cybernetics, pp 2292–2297. https://doi.org/10.1109/smc.2013.392

35.

Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves M, Meira W Jr (2011) Word co-occurrence features for text classification. Inf Syst 36(5):843–858. https://doi.org/10.1016/j.is.2011.02.002 CrossRef

Title: Term weighting scheme for short-text classification: Twitter corpuses
Authors: Issa Alsmadi
Gan Keng Hoon
Publication date: 06-01-2018
Publisher: Springer London
Published in: Neural Computing and Applications / Issue 8/2019
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-017-3298-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 8/2019

Polynomial Kernel Discriminant Analysis for 2D visualization of classification problems

Sufficiency and duality in interval-valued variational programming

Protein fold recognition using Deep Kernelized Extreme Learning Machine and linear discriminant analysis

Asymptotic and finite-time synchronization of memristor-based switching networks with multi-links and impulsive perturbation

Building a smart lecture-recording system using MK-CPN network for heterogeneous data sources

Incremental gravitational search algorithm for high-dimensional benchmark functions

Premium Partner