Sentiment analysis measures the polarity or tonality of texts by identifying and assessing expressions people use to evaluate or appraise persons, entities or events (Pang and Lee
2008; Liu
2012; Soroka
2014; Mohammad
2016). Analyzing the polarity of texts has a long tradition in the social sciences. A prominent example is media negativity, a concept that captures the over-selection of negative over positive news, the tonality of media stories, and the degree of conflict or confrontation in the news (Esser and Strömbäck
2012). Its “measurement in quantitative content analytic research can be defined as the process of linking certain aspects of textual data to numerical values that represent the presence, intensity, and frequency of textual aspects relevant to communication research” (Lengauer et al.
2012, p. 183). A number of recent studies demonstrate the benefits of sentiment analysis for such analyses (Van Atteveldt et al.
2008; Soroka
2012; Young and Soroka
2012; Burscher et al.
2015; Soroka et al.
2015a,
b). Sentiment analysis has also been used to establish the level of support for legislative proposals or polarization from the analysis of parliamentary debates (Monroe et al.
2008), to identify issue positions or public opinion in online debates (Hopkins and King
2010; Ceron et al.
2012; González-Bailón and Paltoglou
2015), or for studying negative campaigning (Kahn and Kenney
2004; Lau and Pomper
2004; Geer
2006; Nai and Walter
2015) to mention just a few prominent uses. The classification of text as positive, negative, or neutral, is denoted by expressions such as polarity, valence or tone (Wilson et al.
2005; Young and Soroka
2012; Thelwall and Buckley
2013; González-Bailón and Paltoglou
2015; Mohammad
2016). An incomplete list of terms for the gradual or quantitative measurement of sentiment includes potency (Osgood et al.
1957); intensity, sentiment strength (e.g. Thelwall et al.
2010) or emotive force (Macagno and Walton
2014). We will use sentiment strength and tonality as synonymous terms for a fine-grained measure of negativity. We cover only the neutral to negative part of the sentiment scale as psychological research has highlighted asymmetries between positive and negative evaluations of situations, persons or events (Peeters
1971; Peeters and Czapinski
1990; Cacioppo and Berntson
1994; Baumeister et al.
2001; Rozin and Royzman
2001). We also do not probe into different ‘negative’ emotions (Ekman
1992) nor look at causes of negative evaluations (Soroka
2014; Soroka et al.
2015a).
The field of sentiment analysis is dominated by computer-based, automated approaches whose progress varies strongly by language (Mohammad
2016). Many social scientists will be still more familiar with human-based content analyses with or without dictionaries (Stone et al.
1966; Budge and Farlie
1983; Baumgartner and Jones
1993; Laver and Garry
2000; Young and Soroka
2012; Krippendorff
2013). Both manual and automated text analysis require an initial step of coding (or annotating or labelling) the sentiment of a text unit. Supervised and non-supervised automated approaches employ sample texts with coded sentiment ratings to ‘learn’ the sentiment of words. Once that phase of the research process has concluded—which usually includes a considerable amount of ‘fine-tuning’ the procedure—, the algorithms are scalable to large text corpora. Manual coding, in contrast, does not scale well as human coders often have to rate small units of texts such as sentences or words. Compared to unit by unit hand coding creating and using a dictionary of words already coded is a big step towards higher efficiency. An automated search can then find out whether a new text unit contains a dictionary word and retrieve its sentiment value.
A basic assumption of using a dictionary is that it contains the most important words required for rating a text. A recent comparison of English language dictionaries and machine learning approaches found that “dictionaries had exceptional precision, but very low recall, suggesting that the method can be accurate, but that current lexicons are lacking scope. Machine learning systems worked in the opposite manner, exhibiting greater coverage but more error” (Soroka et al.
2015a, p. 112). A large dictionary can provide good scope, but dictionary size on its own misleads about the quality of the output as irrelevant vocabulary produces less discriminating sentiment scores (González-Bailón and Paltoglou
2015).
Related is the problem of domain specificity. Sentiment scores of words extracted from a training set of annotated texts do not generalize well to texts from other domains. Social scientists have accordingly stressed the need for custom-made dictionaries (Loughran and McDonald
2011; Young and Soroka
2012; Grimmer and Stewart
2013; González-Bailón and Paltoglou
2015; Soroka et al.
2015a). Even some commercial providers advise against using a sentiment dictionary ‘as is’ without thorough customization.
2
We have pointed out that creating a customized dictionary or setting up a sample of training texts for machine learning requires an initial step of human coding which will be a procedural bottleneck if unit-by-unit sentiment coding has to be done with a small number of coders. We mitigate this bottleneck through crowdcoding, which offers a cheap and fast way to collect annotations for large amounts of text.