nach oben

Social Network Analysis and Mining

Erschienen in:

01.12.2016 | Original Article

From classification to quantification in tweet sentiment analysis

verfasst von: Wei Gao, Fabrizio Sebastiani

Erschienen in: Social Network Analysis and Mining | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Sentiment classification has become a ubiquitous enabling technology in the Twittersphere, since classifying tweets according to the sentiment they convey towards a given entity (be it a product, a person, a political party, or a policy) has many applications in political science, social science, market research, and many others. In this paper, we contend that most previous studies dealing with tweet sentiment classification (TSC) use a suboptimal approach. The reason is that the final goal of most such studies is not estimating the class label (e.g., Positive, Negative, or Neutral) of individual tweets, but estimating the relative frequency (a.k.a. “prevalence”) of the different classes in the dataset. The latter task is called quantification, and recent research has convincingly shown that it should be tackled as a task of its own, using learning algorithms and evaluation measures different from those used for classification. In this paper, we show (by carrying out experiments using two learners, seven quantification-specific algorithms, and 11 TSC datasets) that using quantification-specific algorithms produces substantially better class frequency estimates than a state-of-the-art classification-oriented algorithm routinely used in TSC. We thus argue that researchers interested in tweet sentiment prevalence should switch to quantification-specific (instead of classification-specific) learning algorithms and evaluation measures.

Vorheriger Artikel Modeling attitude diffusion and agenda setting: the MAMA model

Nächster Artikel Using geolocated tweets for characterization of Twitter in Portugal and the Portuguese administrative regions

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Consistent with most mathematical literature, we use the caret symbol (\(\wedge\)) to indicate estimation.

Since the standard logistic function \(\frac{e^{x}}{e^{x}+1}\) ranges (for the domain \([0,+\infty )\) we are interested in) on [\(\frac{1}{2}\),1], we multiply by 2 in order for it to range on [1,2], and subtract 1 in order for it to range on [0,1], as desired.

http://www.ark.cs.cmu.edu/TweetNLP/.

In Joachims (2005), SVM-perf is actually called SVM-multi, but the author has released its implementation under the name SVM-perf; we will thus use this latter name.

SVM-perf is available from http://svmlight.joachims.org/svm_struct.html, while the module that customizes it to \({{\mathrm{KLD}}}\) is available from http://hlt.isti.cnr.it/quantification/. The code for all the other methods discussed in this section is available from http://alt.qcri.org/~wgao/codes/tweet_sentiment_quantification.zip.

This means that we avoid STC datasets in which the labels are automatically derived from, say, the emoticons present in the tweets.

In order to enhance the reproducibility of our experimental results, we make available (at http://alt.qcri.org/~wgao/data/SNAM/tweet_sentiment_quantification.zip) the vectorial representations we have generated for all the datasets (split into training / validation / test sets) used in this paper.

The SVM-based implementation of CC is called SVM(HL) in Gao and Sebastiani (2015). LIBSVM is available from http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

At the time of writing this paper, the test set of the SemEval2016 collection has not yet been made available. However, the data made available by the organizers were already pre-split into three subsets, called “train”, “dev”, and “devtest”; we have thus used these subsets as the training set, held-out set, and test set, respectively.

http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

Alaíz-Rodríguez R, Guerrero-Curieses A, Cid-Sueiro J (2011) Class and subclass probability re-estimation to adapt a classifier in the presence of concept drift. Neurocomputing 74(16):2614–2623CrossRef

Asur S, Huberman BA (2010) Predicting the future with social media. In: Proceedings of the 10th IEEE/WIC/ACM international conference on web intelligence (WI 2010), pp 492–499, Toronto, CA

Balikas G, Partalas I, Gaussier E, Babbar R, Amini M-R (2015) Efficient model selection for regularized classification by exploiting unlabeled data. In: Proceedings of the 14th international symposium on intelligent data analysis (IDA 2015), pp 25–36, Saint Etienne, FR

Barranquero J, González P, Díez J, del Coz JJ (2013) On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognit 46(2):472–482CrossRefMATH

Barranquero J, Díez J, del Coz JJ (2015) Quantification-oriented learning based on reliable classifiers. Pattern Recognit 48(2):591–604CrossRef

Beijbom O, Hoffman J, Yao E, Darrell T, Rodriguez-Ramirez A, Gonzalez-Rivero M, Hoegh-Guldberg O (2015) Quantification in-the-wild: Data-sets and baselines. Presented at the NIPS 2015 Workshop on Transfer and Multi-Task Learning. Montreal, CA

Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2010) Quantification via probability estimators. In: Proceedings of the 11th IEEE international conference on data mining (ICDM 2010), pp 737–742, Sydney, AU

Berardi G, Esuli A, Sebastiani F (2015) Utility-theoretic ranking for semi-automated text classification. ACM Trans Knowl Discov Data 10(1). Article 6

Bollen J, Mao H, Zeng X-J (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8CrossRef

Borge-Holthoefer J, Magdy W, Darwish K, Weber I (2015) Content and network dynamics behind Egyptian political polarization on Twitter. In: Proceedings of the 18th ACM conference on computer supported cooperative work and social computing (CSCW 2015), pp 700–711, Vancouver, CA

Burton S, Soboleva A (2011) Interactive or reactive? Marketing with Twitter. J Consumer Mark 28(7):491–499CrossRef

Chan YS, Ng HT (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 44th annual meeting of the Association for Computational Linguistics (ACL 2006), pp 89–96, Sydney, AU

Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3). Article 27

Conroy BR, Sajda P (212) Fast, exact model selection and permutation testing for L2-regularized logistic regression. In: Proceedings of the 15th international conference on artificial intelligence and statistics (AISTATS 2012), pp 246–254, La Palma, ES

Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New YorkCrossRefMATH

Csiszár I, Shields PC (2004) Information theory and statistics: a tutorial. Found Trends Commun Inf Theory 1(4):417–528CrossRefMATH

Da San Martino G, Gao W, Sebastiani F (2016) QCRI at SemEval-2016 Task 4: probabilistic methods for binary and ordinal quantification. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval 2016), San Diego, US (Forthcoming)

Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38MathSciNetMATH

Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH

Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLoS One 6(12):e26752CrossRef

Esuli A (2016) ISTI-CNR at SemEval-2016 Task 4: quantification on an ordinal scale. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval 2016), San Diego, US

Esuli A, Sebastiani F (2010) Sentiment quantification. IEEE Intell Syst 25(4):72–75CrossRef

Esuli A, Sebastiani F (2014) Explicit loss minimization in quantification applications (preliminary draft). In: Presented at the 8th international workshop on information filtering and retrieval (DART 2014), Pisa, IT

Esuli A, Sebastiani F (2015) Optimizing text quantifiers for multivariate loss functions. ACM Trans Knowl Discov Data 9(4). Article 27

Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874MATH

Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European Conference on machine learning (ECML 2005), pp 564–575, Porto, PT

Forman G (2008) Quantifying counts and costs via classification. Data Min Knowl Discov 17(2):164–206MathSciNetCrossRef

Gao W, Sebastiani F (2015) Tweet sentiment: from classification to quantification. In: Proceedings of the 7th international conference on advances in social network analysis and mining (ASONAM 2015), pp 97–104, Paris, FR

González-Castro V, Alaiz-Rodríguez R, Alegre E (2013) Class distribution estimation based on the Hellinger distance. Inf Sci 218:146–164CrossRef

Herfort B, Schelhorn S-J, de Albuquerque JP, Zipf A (2014) Does the spatiotemporal distribution of tweets match the spatiotemporal distribution of flood phenomena? A study about the river Elbe flood in June 2013. In: Proceedings of the 11th international conference on information systems for crisis response and management (ISCRAM 2014), pp 747–751, Philadelphia, US

Hopkins DJ, King G (2010) A method of automated nonparametric content analysis for social science. Am J Political Sci 54(1):229–247CrossRef

Joachims T (2005) A support vector method for multivariate performance measures. In: Proceedings of the 22nd international conference on machine learning (ICML 2005), pp 377–384, Bonn, DE

Joachims T, Hofmann T, Yue Y, Yu C-N (2009) Predicting structured objects with support vector machines. Commun ACM 52(11):97–104CrossRef

Kaya M, Fidan G, Toroslu IH (2013) Transfer learning using Twitter data for improving sentiment classification of Turkish political news. In: Proceedings of the 28th international symposium on computer and information sciences (ISCIS 2013), pp 139–148, Paris, FR

King G, Lu Y (2008) Verbal autopsy methods with multiple causes of death. Stat Sci 23(1):78–91MathSciNetCrossRefMATH

Kiritchenko S, Zhu X, Mohammad SM (2014) Sentiment analysis of short informal texts. J Artif Intell Res 50:723–762MATH

Latinne P, Saerens M, Decaestecker C (2001) Adjusting the outputs of a classifier to new a priori probabilities may significantly improve classification accuracy: evidence from a multi-class problem in remote sensing. In: Proceedings of the 18th international conference on machine learning (ICML 2001), pp 298–305

Lewis DD (1995) Evaluating and optimizing autonomous text classification systems. In: Proceedings of the 18th ACM international conference on research and development in information retrieval (SIGIR 1995), pp 246–254, Seattle, US

Limsetto N, Waiyamai K (2011) Handling concept drift via ensemble and class distribution estimation technique. In: Proceedings of the 7th international conference on advanced data mining (ADMA 2011), pp 13–26, Beijing, CN

Marchetti-Bowick M, Chambers N (2012) Learning for microblogs with distant supervision: political forecasting with Twitter. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), pp 603–612, Avignon, FR

Martínez-Cámara E, Martín-Valdivia MT, López LAU, Ráez AM (2014) Sentiment analysis in Twitter. Nat Lang Eng 20(1):1–28CrossRef

Mejova Y, Weber I, Macy MW (eds) (2015) Twitter: a digital socioscope. Cambridge University Press, Cambridge

Milli L, Monreale A, Rossetti G, Giannotti F, Pedreschi D, Sebastiani F (2013) Quantification trees. In: Proceedings of the 13th IEEE international conference on data mining (ICDM 2013), pp 528–536, Dallas, US

Mohammad SM, Kiritchenko S, Zhu X (2013) NRC-Canada: building the state-of-the-art in sentiment analysis of tweets. In: Proceedings of the 7th international workshop on semantic evaluation (SemEval 2013), pp 321–327, Atlanta, US

Murphy KP (2012) Machine learning. A probabilistic perspective. The MIT Press, CambridgeMATH

Nakov P, Rosenthal S, Kozareva Z, Stoyanov V, Ritter A, Wilson T (2013) SemEval-2013 Task 2: sentiment analysis in Twitter. In: Proceedings of the 7th international workshop on semantic evaluation (SemEval 2013), pp 312–320, Atlanta, US

Nakov P, Ritter A, Rosenthal S, Sebastiani F, Stoyanov V (2016) SemEval-2016 Task 4: sentiment analysis in Twitter. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval 2016), San Diego, US (forthcoming)

Narasimhan H, Li S, Kar P, Chawla S, Sebastiani F (2016) Stochastic optimization techniques for quantification performance measures. Submitted for publication

O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. In: Proceedings of the 4th AAAI Conference on Weblogs and Social Media (ICWSM 2010), Washington, US

Olteanu A, Vieweg S, Castillo C (2015) What to expect when the unexpected happens: social media communications across crises. In: Proceedings of the 18th ACM conference on computer supported cooperative work and social computing (CSCW 2015), pp 994–1009, Vancouver, CA

Pan W, Zhong E, Yang Q (2012) Transfer learning for text mining. In: Aggarwal CC, Zhai CX (eds) Mining text data. Springer, Heidelberg, pp 223–258CrossRef

Qureshi MA, O’Riordan C, Pasi G (2013) Clustering with error estimation for monitoring reputation of companies on Twitter. In: Proceedings of the 9th Asia Information Retrieval Societies Conference (AIRS 2013), pp 170–180. Singapore, SN

Rosenthal S, Nakov P, Kiritchenko S, Mohammad S, Ritter A, Stoyanov V (2015) SemEval-2015 Task 10: sentiment analysis in Twitter. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 451–463, Denver, US

Rosenthal S, Ritter A, Nakov P, Stoyanov V (2014) SemEval-2014 Task 9: sentiment analysis in Twitter. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp 73–80, Dublin, IE

Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1):21–41CrossRefMATH

Saif H, Fernez M, He Y, Alani H (2013) Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold. In: Proceedings of the 1st international workshop on emotion and sentiment in social and expressive media (ESSEM 2013), pp 9–21, Torino, IT

Sánchez L, González V, Alegre E, Alaiz R (2008) Classification and quantification based on image analysis for sperm samples with uncertain damaged/intact cell proportions. In: Proceedings of the 5th international conference on image analysis and recognition (ICIAR 2008), pp 827–836, Póvoa de Varzim, PT

Takahashi T, Abe S, Igata N (2011) Can Twitter be an alternative of real-world sensors? In: Proceedings of the 14th international conference on human–computer interaction (HCI International 2011), pp 240–249, Orlando, US

Tang L, Gao H, Liu H (2010) Network quantification despite biased labels. In: Proceedings of the 8th workshop on mining and learning with graphs (MLG 2010), pp 147–154, Washington, US

Tsochantaridis I, Joachims T, Hofmann T, Altun Y (2005) Large margin methods for structured and interdependent output variables. J Mach Learn Res 6:1453–1484MathSciNetMATH

Vapnik V (1998) Statistical learning theory. Wiley, New YorkMATH

Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef

Wu T-F, Lin C-J, Weng RC (2004) Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res 5:975–1005MathSciNetMATH

Xue JC, Weiss GM (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD 2009), pp 897–906, Paris, FR

Zhang Z, Zhou J (2010) Transfer estimation of evolving class priors in data stream classification. Pattern Recognit 43(9):3151–3161CrossRefMATH

Zhu X, Kiritchenko S, Mohammad SM (2014) NRC-Canada-2014: recent improvements in the sentiment analysis of tweets. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp 443–447, Dublin, IE

Zou F, Wang Y, Yang Y, Zhou K, Chen Y, Song J (2015) Supervised feature learning via L2-norm regularized logistic regression for 3D object recognition. Neurocomputing 151:603–611CrossRef

Titel: From classification to quantification in tweet sentiment analysis
verfasst von: Wei Gao
Fabrizio Sebastiani
Publikationsdatum: 01.12.2016
Verlag: Springer Vienna
Erschienen in: Social Network Analysis and Mining / Ausgabe 1/2016
Print ISSN: 1869-5450
Elektronische ISSN: 1869-5469
DOI: https://doi.org/10.1007/s13278-016-0327-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2016

Centrality in the global network of corporate control

Comparison of sentiment lexicon development techniques for event prediction

Exploiting abused trending topics to identify spam campaigns in Twitter

Sentiment/subjectivity analysis survey for languages other than English

A scalable geometric algorithm for community detection from social networks with incremental update

Notice to: Promoting where, when and what? An analysis of web logs by integrating data mining and social network techniques to guide ecommerce business promotions