1 Introduction
-
RQ1: To what extent do different sentiment analysis tools agree with emotions of software developers?
-
RQ2: To what extent do results from different sentiment analysis tools agree with each other?
-
RQ3: Do different sentiment analysis tools lead to contradictory results in a software engineering study?
-
RQ4: How does the choice of a sentiment analysis tool affect validity of the previously published results?
2 Sentiment Analysis Tools
2.1 Tool Selection
2.2 Description of Tools
2.2.1 SentiStrength
2.2.2 Alchemy
2.2.3 NLTK
2.2.4 Stanford NLP
3 Agreement Between Sentiment Analysis Tools
3.1 Methodology
3.1.1 Manually-Labeled Software Engineering Data
3.1.2 Evaluation Metrics
positive | neutral | negative | |
---|---|---|---|
positive | 0 | 1 | 2 |
neutral | 1 | 0 | 1 |
negative | 2 | 1 | 0 |
3.2 Results
F | |||||||
---|---|---|---|---|---|---|---|
Tools |
κ
| ARI | neu | pos | neg | ||
NLTK
| vs. | manual | 0.33 | 0.21 | 0.76 | 0.53 | 0.31 |
SentiStrength
| vs. | manual | 0.31 | 0.13 | 0.73 | 0.47 | 0.35 |
Alchemy | vs. | manual | 0.26 | 0.07 | 0.53 | 0.54 | 0.23 |
Stanford NLP | vs. | manual | 0.20 | 0.11 | 0.48 | 0.53 | 0.20 |
NLTK
| vs. |
SentiStrength
| 0.22 | 0.08 | 0.64 | 0.45 | 0.33 |
NLTK
| vs. | Alchemy | 0.20 | 0.09 | 0.52 | 0.60 | 0.44 |
NLTK
| vs. | Stanford NLP | 0.12 | 0.05 | 0.48 | 0.42 | 0.47 |
SentiStrength
| vs. | Alchemy | 0.07 | 0.07 | 0.56 | 0.55 | 0.38 |
SentiStrength
| vs. | Stanford NLP | −0.14 | 0.00 | 0.51 | 0.33 | 0.35 |
Alchemy | vs. | Stanford NLP | 0.25 | 0.05 | 0.41 | 0.43 | 0.58 |
3.3 Discussion
Due to presence of the expression of gratitude, the comment has been labeled as “love” by all four participants of the Murgia’s study. We interpret this as a clear indication of the positive sentiment. However, none of the tools is capable of recognizing this: SentiStrength labels the comment as being neutral, NLTK, Alchemy and Stanford NLP—as being negative. Indeed, for instance Stanford NLP believes the first three sentences to be negative (e.g., due to presence of “No”), and while it correctly recognizes the last sentence as positive, this is not enough to change the evaluation of the comment as the whole.To test this I used an aggregate AE with a CAS multiplier that declared getCasInstancesRequired()=5. If this AE is instantiated and run in a loop with earlier code it eats up roughly 10MB per iteration. No such leak with the latest code. Thanks!
Three out of four annotators do not recognize presence of emotion in this comment and we interpret this as the comment being neutral. However, keyword-based sentiment analysis tools might wrongly identify presence of sentiment. For instance, in SentiWordNet (Baccianella et al. 2010) the verb “commit”, in addition to neutral meanings (e.g., perpetrate an act as in “commit a crime”) has several positive meanings (e.g., confer a trust upon, “I commit my soul to God” or cause to be admitted when speaking of a person to an institution, “he was committed to prison”). In a similar way, the word “patch”, in addition to neutral meanings, has negative meanings (e.g.,, sewing that repairs a worn or torn hole or a piece of soft material that covers and protects an injured part of body). Hence, it should come as no surprise that some sentiment analysis tools identify this comment as positive, some other as negative and finally, some as neutral.D.E. Veloper9 committed your patch for Xerces 2.6.0. Please verify.
3.4 A Follow-up Study
F | ||||||
---|---|---|---|---|---|---|
Tools | n |
κ
| ARI | neu | pos | neg |
NLTK, SentiStrength
| 138 | 0.65 | 0.51 | 0.89 | 0.78 | 0.56 |
NLTK, Alchemy | 134 | 0.46 | 0.24 | 0.73 | 0.69 | 0.47 |
NLTK, Stanford NLP | 122 | 0.43 | 0.23 | 0.71 | 0.74 | 0.40 |
SentiStrength, Alchemy | 133 | 0.50 | 0.27 | 0.76 | 0.71 | 0.43 |
SentiStrength, Stanford NLP | 109 | 0.53 | 0.34 | 0.78 | 0.83 | 0.39 |
Alchemy, Stanford NLP | 130 | 0.36 | 0.19 | 0.49 | 0.79 | 0.31 |
NLTK, SentiStrength, Alchemy | 88 | 0.68 | 0.49 | 0.84 | 0.84 | 0.58 |
NLTK, SentiStrength, Stanford NLP | 71 | 0.72 | 0.52 | 0.85 | 0.91 | 0.55 |
SentiStrength, Alchemy, Stanford NLP | 74 | 0.59 | 0.38 | 0.73 | 0.91 | 0.41 |
NLTK, Alchemy, Stanford NLP | 75 | 0.55 | 0.28 | 0.68 | 0.83 | 0.52 |
NLTK, SentiStrength, Alchemy, Stanford NLP | 53 | 0.72 | 0.50 | 0.80 | 0.93 | 0.57 |
3.5 Threats to Validity
-
Internal validity of our evaluation might have been affected by the exact ways tools have been applied and the interpretation of the tools’ output as indication of sentiment, e.g., calculation of a document-level sentiment as −2∗#0−#1 + #3+2∗#4 for Stanford NLP. Another threat to internal validity stems form the choice of the evaluation metrics: to reduce this threat we report several agreement metrics (ARI, weighted κ and F-measures) recommended in the literature.
-
External validity of this study can be threatened by the fact that only one dataset has been considered and by the way this dataset has been constructed and evaluated by Murgia et al. (2014). To encourage replication of our study and evaluation of its external validity we make publicly available both the source code and the data used to obtain the results of this paper.10
3.6 Summary
4 Impact of the Choice of Sentiment Analysis Tool
4.1 Methodology
4.1.1 Sentiment Analysis Tools
4.1.2 Datasets
Mean | Std Dev | Median | |
---|---|---|---|
Android | 79.58 | 143.19 | 9 |
Gnome | 267.03 | 1.33 | 26.94 |
SO | 21.53 | 131.32 | 0.13 |
ASF | 96.57 | 255.44 | 4.16 |
resolved
. In total 367,877 have been resolved. We analyze the sentiment of the short descriptions of the issues (short_desc) and calculate the time difference in seconds between the creation and closure of each issue. Recall that as opposed to the Android dataset, Gnome issues do not have titles.gnome
and having an accepted answer. For all 410 collected posts, we calculate the time difference in seconds between the creation of the post and the creation of the accepted answer. Before applying a sentiment analysis tool we remove HTML formatting from the titles and bodies of posts. In the results, we refer to the body
of a post as its description.4.1.3 Politeness Analysis
4.1.4 Statistical Analysis
4.1.5 Agreement Between the Results
-
x is the number of pairs for which the tools agree about the relation between the response times (>> or <<),
-
y is the number of pairs for which the tools agree about the lack of such a relation (∥∥),
-
z is the number of pairs when one of the tools has established the relation and another one did not (∥>, ∥<, <∥ or >∥),
-
w is the number of pairs when the tools have established different relations (<> or ><).
4.2 Results
NLTK
|
SentiStrength
|
NLTK
\(\cap \)
SentiStrength
| |
---|---|---|---|
neg-neu-pos | neg-neu-pos | neg-neu-pos | |
Android
| |||
title | 1,230-3,588-398 | 1,417-3,415-384 | 396-2,381-36 |
∅
|
∅
|
∅
| |
descr | 2,690-1,657-869 | 1,684-2,435-1,182
a
| 893-712-299 |
neu > neg\(^{***}_{2.79\times 10^{-8}}\)
| neu > neg\(^{*}_{2.54\times 10^{-2}}\)
| ||
neu > pos\(^{**}_{5.55\times 10^{-3}}\)
| neu > pos\(^{**}_{9.72\times 10^{-3}}\)
| neu > pos\(^{***}_{7.53\times 10^{-5}}\)
| |
neg > pos\(^{***}_{6.32\times 10^{-4}}\)
| neg > pos\(^{*}_{3.81\times 10^{-2}}\)
| ||
Gnome
| |||
descr | 54,032-291,906-20,380 | 58,585-293,226-14,507 | 16,829-24,2780-1,785 |
neg > neu\(^{***}_{0}\)
| neg > neu\(^{***}_{0}\)
| neg > neu\(^{***}_{0}\)
| |
pos > neu\(^{***}_{0}\)
| pos > neu\(^{***}_{0}\)
| pos > neu\(^{***}_{0}\)
| |
pos > neg\(^{***}_{0}\)
| |||
neg > pos\(^{***}_{0}\)
| |||
Stack Overflow
| |||
title | 84-285-41 | 53-330-27 | 16-240-8 |
∅
|
∅
|
∅
| |
descr | 249-71-90 | 90-183-137 | 62-35-42 |
∅
| neg > pos\(^{*}_{3.46\times 10^{-2}}\)
|
∅
| |
ASF
| |||
title | 19,367-67,948-8,348
b
| 24,141-62,016-9,510 | 6,450-44,818-1,106 |
pos > neu\(^{***}_{0}\)
| pos > neu\(^{**}_{3.71\times 10^{-3}}\)
| ||
pos > neg\(^{***}_{0}\)
| pos > neg\(^{***}_{2.60\times 10^{-12}}\)
| ||
descr
c
| 30,339-42,540-13,129
d
| 29,021-41,043-15,971
e
| 10,989-20,940-3,814 |
neg > neu\(^{***}_{0}\)
| neg > neu\(^{***}_{0}\)
| neg > neu\(^{***}_{0}\)
| |
pos > neu\(^{***}_{0}\)
| pos > neu\(^{***}_{0}\)
| pos > neu\(^{***}_{0}\)
| |
pos > neg\(^{***}_{5.32\times 10^{-13}}\)
| pos > neg\(^{***}_{5.12\times 10^{-13}}\)
|
NLTK vs. |
NLTK vs. |
SentiStrength vs. | |
---|---|---|---|
SentiStrength
|
NLTK
\(\cap \)
SentiStrength
|
NLTK
\(\cap \)
SentiStrength
| |
Android
| |||
title | 0−3−0−0 | 0−3−0−0 | 0−3−0−0 |
descr | 1−0−2−0 | 2−0−1−0 | 2−0−1−0 |
Gnome
| |||
desc | 2−0−0−1 | 2−0−1−0 | 2−0−1−0 |
Stack Overflow
| |||
title | 0−3−0−0 | 0−3−0−0 | 0−3−0−0 |
desc | 0−2−1−0 | 0−3−0−0 | 0−2−1−0 |
ASF
| |||
title | 1−1−1−0 | 0−1−2−0 | 1−1−1−0 |
desc | 2−0−1−0 | 2−0−1−0 | 3−0−0−0 |
NLTK
|
SentiStrength
|
NLTK
\(\cap \)
SentiStrength
| |||||||
---|---|---|---|---|---|---|---|---|---|
title | |||||||||
neg | neu | pos | neg | neu | pos | neg | neu | pos | |
imp | 948 | 2872 | 268 | 1077 | 2729 | 279 | 297 | 1935 | 18 |
neu | 245 | 693 | 120 | 315 | 652 | 89 | 86 | 432 | 17 |
pol | 37 | 23 | 10 | 22 | 32 | 16 | 13 | 14 | 1 |
∅
|
∅
| —
a
| |||||||
descr | |||||||||
neg | neu | pos | neg | neu | pos | neg | neu | pos | |
imp | 262 | 220 | 41 | 218 | 236 | 68 | 118 | 110 | 7 |
neu | 562 | 530 | 144 | 470 | 515 | 251 | 211 | 229 | 46 |
pol | 1866 | 907 | 684 | 996 | 1594 | 863 | 564 | 373 | 246 |
neg.neu > pos.pol\(^{**}_{1.40\times 10^{-3}}\)
| |||||||||
neg.pol > pos.pol\(^{*}_{4.55\times 10^{-2}}\)
| |||||||||
neu.imp > neg.pol\(^{*}_{4.63\times 10^{-2}}\)
| |||||||||
neu.imp > pos.pol\(^{**}_{7.20\times 10^{-3}}\)
| |||||||||
neu.neu > neg.imp\(^{*}_{4.23\times 10^{-2}}\)
| |||||||||
neu.neu > neg.pol\(^{***}_{1.19\times 10^{-5}}\)
| neu.neu > neg.pol\(^{*}_{3.89\times 10^{-2}}\)
| ||||||||
neu.neu > pos.pol\(^{*}_{3.91\times 10^{-2}}\)
| neu.neu > pos.pol\(^{**}_{3.14\times 10^{-3}}\)
| ||||||||
neu.pol > neg.pol\(^{***}_{8.19\times 10^{-4}}\)
|
NLTK vs. |
NLTK vs. |
SentiStrength vs. | |
---|---|---|---|
SentiStrength
|
NLTK
\(\cap \)
SentiStrength
|
NLTK
\(\cap \)
SentiStrength
| |
Android
| |||
title | 0−36−0−0 | —
a
| —
a
|
descr | 0−30−6−0 | 1−30−5−0 | 1−30−5−0 |
Gnome
| |||
desc | 14−13−7−2 | 10−15−11−0 | 10−18−8−0 |
Stack Overflow
| |||
title | 0−28−0−0
b
| —
c
| —
c
|
desc | 0−33−3−0 | —
c
| —
c
|
ASF
| |||
title | 1−24−10−1 | 0−31−5−0 | 0−27−9−0 |
desc | 25−3−7−1 | 23−5−8−0 | 23−4−9−0 |
4.3 Discussion
4.4 Threats to Validity
5 Implications on Earlier Studies
5.1 Replicated Studies
5.2 Replication Approach
5.2.1 Pletea et al.
5.2.2 Guzman et al.
5.3 Replication Results
5.3.1 Pletea et al.
Type | Comments | Discussions | |||
---|---|---|---|---|---|
Commits | Pletea et al. (2014) | Security | 2689 (4.43 %) | 1809 (9.84%) | |
Total | 60658 | 18380 | |||
Current study | Before elimination | Security | 2509 (4.13 %) | 1706 (9.28 %) | |
Total | 60658 | 18377 | |||
Excluded SentiStrength
| 9 | 32 | |||
Excluded NLTK
| 0 | 1 | |||
For further analysis | Security | 2509 (4.14 %) | 1689 (9.21 %) | ||
Total | 60649 | 18344 | |||
Pletea et al. (2014) | Security | 2689 (4.43 %) | 1809 (9.84 %) | ||
Total | 60658 | 18380 | |||
Current study | Before elimination | Security | 1801 (3.28 %) | 1091 (11.36 %) | |
Total | 54892 | 9601 | |||
Excluded SentiStrength
| 1 | 16 | |||
Excluded NLTK
| 5 | 0 | |||
For further analysis | Security | 1800 (3.28 %) | 1081 (11.28 %) | ||
Total | 54886 | 9585 |
Type | Negative | Neutral | Positive | ||
---|---|---|---|---|---|
Discussions | Pletea et al. (2014) | Security | 72.52 % | 10.88 % | 16.58 % |
NLTK
| Rest | 52.28 % | 20.37 % | 25.33 % | |
Current study | Security | 70.16 % | 12.79 % | 17.05 % | |
NLTK
| Rest | 52.89 % | 21.50 % | 25.61 % | |
Current study | Security | 30.66 % | 42.92 % | 26.40 % | |
SentiStrength
| Rest | 24.13 % | 43.92 % | 31.94 % | |
Comments | Pletea et al. (2014) | Security | 55.59 % | 23.42 % | 20.97 % |
NLTK
| Rest | 46.94 % | 26.58 % | 26.47 % | |
Current study | Security | 55.96 % | 22.88 % | 21.16 % | |
NLTK
| Rest | 46.89 % | 26.61 % | 26.50 % | |
Current study | Security | 32.60 % | 46.95 % | 20.44 % | |
SentiStrength
| Rest | 22.30 % | 50.74 % | 26.95 % |
Type | Negative | Neutral | Positive | ||
---|---|---|---|---|---|
Discussions | Pletea et al. (2014) | Security | 81.00 % | 5.52 % | 13.47 % |
NLTK
| Rest | 69.58 % | 11.98 % | 18.42 % | |
Current study | Security | 77.61 % | 7.03 % | 15.36 % | |
NLTK
| Rest | 67.43 % | 13.82 % | 18.76 % | |
Current study | Security | 30.80 % | 45.51 % | 23.68 % | |
SentiStrength
| Rest | 24.15 % | 51.17 % | 24.67 % | |
Comments | Pletea et al. (2014) | Security | 59.83 % | 19.09 % | 21.06 % |
NLTK
| Rest | 50.16 % | 26.12 % | 23.70 % | |
Current study | Security | 59.67 % | 18.83 % | 21.50 % | |
NLTK
| Rest | 49.81 % | 26.45 % | 23.74 % | |
Current study | Security | 25.66 % | 51.22 % | 23.11 % | |
SentiStrength
| Rest | 18.14 % | 62.87 % | 18.97 % |
Sec. relevance | Discussion (Commit ID) | # sec. keywords | Sec. relevance (human) |
NLTK neutral (%) |
NLTK negative (%) |
NLTK positive (%) |
NLTK result |
Senti Strength result | Sentiment (human) |
---|---|---|---|---|---|---|---|---|---|
High | 535033 | 6 | Yes | 16.5 | 42.9 | 57.0 | pos | neutral | neg(*) |
256855 | 4 | Yes | 17.1 | 84.2 | 15.7 | neg | neutral | neg(*) | |
455971 | 6 | Yes | 19.1 | 84.3 | 15.6 | neg | neutral | neutral | |
131473 | 5 | Yes | 21.4 | 45.8 | 54.2 | pos | neg | neg(*****) | |
253685 | 4 | No | 20.4 | 59.1 | 40.8 | neg | neutral | pos(*) | |
370765 | 5 | Yes | 20.0 | 65.0 | 34.9 | neg | neutral | pos(***) | |
59082 | 4 | No | 19.8 | 76.4 | 23.5 | neg | neutral | neg(*) | |
157981 | 11 | Yes | 23.9 | 58.8 | 41.1 | neg | neutral | neg(***) | |
391963 | 9 | Yes | 16.7 | 71.9 | 28.0 | neg | neutral | pos(****) | |
272987 | 4 | Yes | 22.4 | 41.6 | 58.3 | pos | pos | neg(*) | |
Medium | 15128 | 1 | No | 20.6 | 71.3 | 28.6 | neg | neutral | neutral |
396099 | 1 | No | 18.8 | 74.0 | 26.0 | neg | neg | neg(****) | |
132779 | 1 | No | 30.6 | 76.4 | 23.5 | neg | pos | neutral | |
295686 | 1 | No | 23.9 | 70.7 | 29.3 | neg | neutral | pos(*) | |
541007 | 1 | Partial | 37.7 | 71.7 | 28.2 | neg | neg | neg(*) | |
199287 | 1 | Partial | 18.9 | 76.4 | 23.5 | neg | neutral | neg(*) | |
461318 | 1 | Yes | 15.0 | 75.0 | 24.9 | neg | neutral | neg(*) | |
509384 | 1 | Partial | 33.4 | 67.3 | 32.7 | neg | neutral | neutral | |
338681 | 1 | No | 29.9 | 75.5 | 24.4 | neg | pos | neg(*) | |
511734 | 1 | No | 17.6 | 79.4 | 20.5 | neg | pos | pos(***) | |
Low | 364215 | 1 | No | 41.4 | 44.1 | 55.8 | pos | neg | neg(*) |
274571 | 1 | Partial | 30.1 | 46.5 | 53.4 | pos | pos | neg(**) | |
47639 | 1 | Yes | 19.3 | 38.6 | 61.3 | pos | neutral | pos(*****) | |
277765 | 1 | No | 27.0 | 45.2 | 54.7 | pos | pos | pos(*) | |
6491 | 1 | No | 37.6 | 29.6 | 70.4 | pos | neutral | neutral | |
130367 | 1 | No | 15.4 | 43.6 | 56.3 | pos | pos | pos(*) | |
189623 | 1 | No | 57.9 | 35.8 | 64.1 | neutral | neutral | pos(***) | |
41379 | 1 | Partial | 30.9 | 26.1 | 73.8 | pos | pos | pos(***) | |
456580 | 1 | No | 26.6 | 46.6 | 53.3 | pos | pos | pos(***) | |
52122 | 1 | No | 17.6 | 46.3 | 53.6 | pos | neutral | pos(*****) |
5.3.2 Guzman et al.
Lang | Guzman et al. (2014) | Current study | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SentiStrength
| Com |
SentiStrength
|
NLTK
| |||||||||
Com | Mean | SD | Mean | SD | Med | IQR | Mean | SD | Med | IQR | ||
C | 6257 | 0.023 | 1.716 | 6277 | −0.217 | 1.746 | 0.000 | 2.000 | −1.834 | 3.095 | −3.256 | 4.491 |
C++ | 16930 | 0.017 | 1.725 | 16983 | −0.031 | 1.765 | 0.000 | 4.000 | 1.017 | 2.959 | 0.000 | 5.953 |
Java | 4713 | −0.144 | 1.736 | 4712 | −0.282 | 1.887 | 0.000 | 4.000 | −1.753 | 3.106 | −3.191 | 4.460 |
Python | 2128 | −0.018 | 1.711 | 2133 | −0.182 | 1.709 | 0.000 | 2.000 | −1.636 | 3.079 | −3.093 | 4.395 |
Ruby | 15257 | 0.002 | 1.714 | 15355 | −0.034 | 1.794 | 0.000 | 4.000 | 1.243 | 3.117 | 0.000 | 6.293 |
Day | Guzman et al. (2014) | Current study | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SentiStrength
| Com |
SentiStrength
|
NLTK
| |||||||||
Com | Mean | SD | Mean | SD | Med | IQR | Mean | SD | Med | IQR | ||
Mon | 9517 | -0.043 | 1.732 | 9533 | −0.148 | 1.790 | 0.000 | 4.000 | −1.316 | 3.047 | 0.000 | 6.199 |
Tue | 9319 | 0.005 | 1.712 | 9389 | −0.089 | 1.766 | 0.000 | 4.000 | −1.344 | 3.079 | 0.000 | 6.218 |
Wed | 9730 | 0.008 | 1.716 | 9748 | −0.117 | 1.797 | 0.000 | 4.000 | −1.372 | 3.100 | 0.000 | 6.292 |
Thu | 9538 | 0.001 | 1.728 | 9561 | −0.116 | 1.791 | 0.000 | 4.000 | −1.357 | 3.073 | 0.000 | 6.226 |
Fri | 9076 | −0.016 | 1.739 | 9152 | −0.075 | 1.791 | 0.000 | 4.000 | −1.347 | 3.082 | 0.000 | 6.256 |
Sat | 6701 | −0.027 | 1.688 | 6722 | − 0.073 | 1.788 | 0.000 | 4.000 | −1.326 | 3.066 | 0.000 | 6.264 |
Sun | 6544 | 0.022 | 1.717 | 6544 | −0.123 | 1.774 | 0.000 | 4.000 | −1.381 | 3.081 | 0.000 | 6.245 |
Day | Guzman et al. (2014) | Current study | ||||||||||
SentiStrength
| Com |
SentiStrength
|
NLTK
| |||||||||
Com | Mean | SD | Mean | SD | Med | IQR | Mean | SD | Med | IQR | ||
morning | 12714 | 0.001 | 1.730 | 12750 | −0.112 | 1.777 | 0.000 | 4.000 | −1.398 | 3.062 | 0.000 | 6.234 |
afternoon | 19809 | 0.004 | 1.717 | 19859 | −0.089 | 1.764 | 0.000 | 4.000 | −1.326 | 3.076 | 0.000 | 6.235 |
evening | 16584 | −0.023 | 1.721 | 16634 | −0.102 | 1.794 | 0.000 | 4.000 | −1.323 | 3.085 | 0.000 | 6.261 |
night | 11318 | −0.016 | 1.713 | 11415 | −0.142 | 1.820 | 0.000 | 4.000 | −1.370 | 3.077 | 0.000 | 6.246 |