Introduction
Motivation and contributions
-
We explore a Bi-LSTM + Attention model for the task of readers’ emotion detection through multi-target regression settings over short-text news documents and compare the model performance against a set of baselines belonging to various families of textual emotion detection techniques including lexicon based, machine learning, and deep learning, using an extensive set of coarse-grained and fine-grained evaluation measures
-
We investigate interpretability of the attention mechanism to understand the underlying behavior of Bi-LSTM + Attention model for the task of readers’ emotion detection by conducting qualitative and quantitative analysis to quantify the role of emotion words and named entities in the model’s decision making.
-
We procure two new readers’ emotion news datasets, REN-10k and RENh-4k where the news articles are associated with corresponding readers’ emotions. We also assign the associated genre information to the articles. As a result, apart from readers’ emotion detection, these datasets can be used for multiple tasks including, document summarization and genre classification, in various scales (short-text and long-text), making them heterogeneous task datasets. We shall contribute REN-10k at https://dcs.uoc.ac.in/cida/resources/ren-10k.html and RENh-4k at https://dcs.uoc.ac.in/cida/resources/renh-4k.html publicly, along with the publication to aid future research.
Related work
Lexicon based approaches
Machine learning based approaches
Deep learning based approaches
The question of interpretability
Multi-target readers’ emotion detection
Methodology
Empirical study
Dataset
Readers’ emotion news datasets
SemEval-2007
Dataset pre-processing
Statistics | REN-10k | RENh-4k | SemEval-2007 |
---|---|---|---|
Source | Rappler | Rappler | The New York Times, CNN, BBC, Google News |
Year span | 2014 to 2019 | 2015 to 2018 | – |
Length | Short-text (after pre-processing) | Short-text | Short-text |
Number of news documents | 10,272 | 4000 | 1246 (valid documents after pre-processing) |
Total number of words | 305,160 | 124,172 | 6364 |
Number of unique words | 27,749 | 13,260 | 3286 |
Average words per document | 29.70 | 31.043 | 5.09 |
Average sentences per document | 1.18 | 1.1875 | 1.00 |
Number of annotations | 528,327 | 242,680 | 6 (annotators) |
Mean percentage of votes for each emotion class | Anger: 0.2124 | Anger: 0.3388 | Anger: 0.1013 |
Fear: 0.0658 | Fear: 0.1475 | Fear: 0.1639 | |
Joy: 0.4215 | Joy: 0.3137 | Joy: 0.2860 | |
Sadness: 0.1399 | Sadness: 0.0781 | Sadness: 0.2069 | |
Surprise: 0.1606 | Surprise: 0.1218 | Surprise: 0.2416 | |
Number of articles associated with each emotion class | Anger: 6904 | Anger: 3068 | Anger: 652 |
Fear: 4233 | Fear: 1850 | Fear: 820 | |
Joy: 8917 | Joy: 3267 | Joy: 786 | |
Sadness: 5972 | Sadness: 2489 | Sadness: 863 | |
Surprise: 6431 | Surprise: 2312 | Surprise: 1102 |
Experimental setup and evaluations
Model performance evaluation
Deep learning baselines
-
sent2affect [48]: This is a textual emotion detection method that utilizes transfer learning from an RNN model initially trained for the task of sentiment analysis. Towards reproducing their work faithfully, we use sentiment1407 dataset to build the model; the Twitter Sentiment dataset used in their paper was not found in the relevant link provided8. We believe sentiment140 is appropriate for usage primarily due to its large size, comprising as much as 1.6 million data objects.
-
Kim’s CNN [64]: This work is a popular CNN architecture for text classification. The hyper-parameters used to build this model are given in Appendix.
Lexicon based baselines
-
Emotion Term Model [12]: This is an improved version of the classical Naïve Bayes that incorporates information of emotion rating along with the term independence assumption.
-
Synesketch [33]: This is a textual emotion detection system that makes use of a word-level lexicon and an emoticon lexicon, along with a set of heuristic rules.
Classical machine learning baselines
-
WMD [39]: WMD comprises a textual emotion detection method using Word Mover’s Distance feature along with SVM classifier. To reproduce this work faithfully, we use 60% of our corpus for training, 20% for testing, and rest 20% for seed corpus, for the five emotion classes. We use Support Vector Regression (SVR) with multi-output regressor for our multi-target regression problem instead of their SVM classifier.
-
Multi-target regression with handcrafted features: We use multiple methods for multi-target regression, with a rich set of features. We describe the features and the models below:
-
Sentiment Word Feature [32, 65]: Combination of two sets of sentiment-oriented features to form a single sentiment word feature. The first set of features capture total number of positive, negative, and neutral words, and the second set computes average positive, negative, and neutral sentiment intensity for a document. We make use of VADER [66], to compute the sentiment features.
-
Multi-target Regression Models: We now describe the multi-target regression models across various families of methods. Based on the problem transformation approach, we implement Multi-output Regressor using Ridge9, SVR10, and GradientBoostingRegressor11. Within the algorithm adaptation approach, we implement a Multi-Layer Perceptron with a single hidden layer of 128 neurons, ReLU activation and l2(0.001) regularizer, and final output layer with softmax activation. Other hyperparameters are MSE loss function, Adam optimizer with a learning rate 0.0005, batch size set to 64, and 100 epochs.
Performance evaluation measures
Results and discussion
Model | Acc@1 (%)\(\uparrow\) | \(\hbox {AP}_{{\mathrm{document}}}\uparrow\) | \(\hbox {AP}_{{\mathrm{emotion}}}\uparrow\) | \(\hbox {RMSE}_{\mathrm{D}}\downarrow\) | \(\hbox {WD}_{\mathrm{D}}\downarrow\) |
---|---|---|---|---|---|
Bi-LSTM + Attention (Our Method) | 60.55 | 0.7994 | 0.5596 | 0.1500 | 0.0812 |
Deep learning baselines | |||||
sent2affect [48] | 49.39 | 0.5716 | 0.1004 | 0.2383 | 0.1298 |
SS-BED [44] | 55.11 | 0.7090 | 0.4944 | 0.2209 | 0.1202 |
Kim’s CNN [64] | 49.03 | 0.5893 | 0.1610 | 0.2332 | 0.1322 |
Bi-LSTM [48] | 52.80 | 0.6282 | 0.4804 | 0.2215 | 0.1202 |
LSTM [9] | 52.07 | 0.6064 | 0.4581 | 0.2223 | 0.1204 |
GRU | 50.17 | 0.6012 | 0.2013 | 0.2329 | 0.1293 |
Lexicon based baselines | |||||
SWAT [11] | 51.28 | 0.6151 | 0.3483 | 0.2551 | 0.1472 |
Emotion Term Model [12] | 53.57 | 0.6023 | 0.0115 | 0.3343 | 0.2520 |
Synesketch [33] | 35.86 | 0.1632 | 0.2326 | 0.2677 | 0.1664 |
Problem transformation baselines | |||||
WMD [39] | 43.56 | 0.2366 | 0.0981 | 0.3156 | 0.1480 |
49.47 | 0.6019 | 0.3133 | 0.2347 | 0.1235 | |
48.85 | 0.5331 | 0.2512 | 0.2362 | 0.1251 | |
TEC [32] | 50.90 | 0.6035 | 0.3133 | 0.2460 | 0.1297 |
TEI [32] | 50.90 | 0.6088 | 0.3147 | 0.2301 | 0.1243 |
MEI [32] | 50.85 | 0.6029 | 0.2379 | 0.2310 | 0.1255 |
GEC [32] (\(\delta = 0.25\)) | 50.67 | 0.6021 | 0.2765 | 0.2388 | 0.1238 |
GEI [32] (\(\delta = 0.25\)) | 50.63 | 0.6007 | 0.2731 | 0.2392 | 0.1232 |
50.12 | 0.6050 | 0.1939 | 0.2323 | 0.1274 | |
SSWEu [63] (\(d=50\)) | 49.48 | 0.5726 | 0.0714 | 0.2384 | 0.1280 |
GloVe [44] (\(d=100\)) | 49.63 | 0.5670 | 0.0716 | 0.2390 | 0.1279 |
Algorithm adaptation baselines | |||||
50.17 | 0.6071 | 0.2555 | 0.2303 | 0.1268 | |
50.03 | 0.5829 | 0.2173 | 0.2354 | 0.1347 | |
TEC [32] | 50.51 | 0.6625 | 0.3523 | 0.2257 | 0.1214 |
TEI [32] | 53.80 | 0.6516 | 0.3211 | 0.2252 | 0.1209 |
MEI [32] | 49.53 | 0.5713 | 0.1859 | 0.2380 | 0.1291 |
GEC [32] (\(\delta = 0.25\)) | 51.24 | 0.6423 | 0.2758 | 0.2285 | 0.1218 |
GEI [32] (\(\delta = 0.25\)) | 52.60 | 0.6163 | 0.2322 | 0.2221 | 0.1269 |
50.36 | 0.6014 | 0.1839 | 0.2331 | 0.1254 | |
SSWEu [63] (\(d=50\)) | 49.44 | 0.5173 | 0.0984 | 0.3751 | 0.1330 |
GloVe [44] (\(d=100\)) | 49.44 | 0.5169 | 0.0509 | 0.3758 | 0.1334 |
Model | Acc@1 (%)\(\uparrow\) | \(\hbox {AP}_{{\mathrm{document}}}\uparrow\) | \(\hbox {AP}_{{\mathrm{emotion}}}\uparrow\) | \(\hbox {RMSE}_{\mathrm{D}}\downarrow\) | \(\hbox {WD}_{\mathrm{D}}\downarrow\) |
---|---|---|---|---|---|
Bi-LSTM + Attention (Our Method) | 50.50 | 0.6499 | 0.4054 | 0.2301 | 0.1220 |
Deep learning baselines | |||||
sent2affect [48] | 36.00 | 0.4684 | 0.1047 | 0.2508 | 0.1458 |
SS-BED [44] | 45.62 | 0.5534 | 0.3609 | 0.2406 | 0.1424 |
Kim’s CNN [64] | 40.00 | 0.4775 | 0.2084 | 0.2493 | 0.1585 |
Bi-LSTM [48] | 45.00 | 0.6297 | 0.3415 | 0.2400 | 0.1465 |
LSTM [9] | 40.13 | 0.5927 | 0.3402 | 0.2559 | 0.1472 |
GRU | 38.75 | 0.4860 | 0.1765 | 0.2481 | 0.1443 |
Lexicon based baselines | |||||
SWAT [11] | 43.75 | 0.5858 | 0.3005 | 0.2561 | 0.1608 |
Emotion Term Model [12] | 44.10 | 0.5520 | 0.0102 | 0.3369 | 0.2000 |
Synesketch [33] | 31.37 | 0.1394 | 0.2423 | 0.2936 | 0.1792 |
Problem transformation baselines | |||||
WMD [39] | 35.25 | 0.3593 | 0.0289 | 0.2869 | 0.1346 |
44.37 | 0.5007 | 0.3490 | 0.2440 | 0.1316 | |
42.37 | 0.5067 | 0.3009 | 0.2662 | 0.1328 | |
TEC [32] | 41.12 | 0.5686 | 0.3237 | 0.2410 | 0.1357 |
TEI [32] | 44.06 | 0.5908 | 0.3532 | 0.2409 | 0.1316 |
MEI [32] | 40.75 | 0.5394 | 0.2574 | 0.2442 | 0.1411 |
GEC [32] (\(\delta =0.25\)) | 42.75 | 0.5676 | 0.3063 | 0.2410 | 0.1363 |
GEI [32] (\(\delta =0.25\)) | 41.75 | 0.5602 | 0.2963 | 0.2417 | 0.1365 |
39.25 | 0.4883 | 0.1443 | 0.2492 | 0.1386 | |
SSWEu [63] (\(d=50\)) | 41.50 | 0.4969 | 0.1804 | 0.2483 | 0.1367 |
GloVe [44] (\(d=100\)) | 40.75 | 0.5108 | 0.2072 | 0.2474 | 0.1327 |
Algorithm adaptation baselines | |||||
39.62 | 0.4630 | 0.2870 | 0.2516 | 0.1489 | |
42.75 | 0.4926 | 0.2796 | 0.2456 | 0.1505 | |
TEC [32] | 41.37 | 0.5701 | 0.3298 | 0.2496 | 0.1356 |
TEI [32] | 42.87 | 0.6029 | 0.3528 | 0.2473 | 0.1343 |
MEI [32] | 40.12 | 0.4856 | 0.2279 | 0.2488 | 0.1466 |
GEC [32] (\(\delta = 0.25\)) | 44.75 | 0.5726 | 0.3190 | 0.2406 | 0.1359 |
GEI [32] (\(\delta = 0.25\)) | 41.37 | 0.5532 | 0.2934 | 0.2419 | 0.1378 |
39.62 | 0.4846 | 0.1343 | 0.2491 | 0.1425 | |
35.62 | 0.3080 | 0.0207 | 0.4246 | 0.1376 | |
GloVe [44] (\(d=100\)) | 35.37 | 0.2382 | 0.0920 | 0.4373 | 0.1376 |
Model | Acc@1 (%)\(\uparrow\) | \(\hbox {AP}_{{\mathrm{document}}}\uparrow\) | \(\hbox {AP}_{{\mathrm{emotion}}}\uparrow\) | \(\hbox {RMSE}_{\mathrm{D}}\downarrow\) | \(\hbox {WD}_{\mathrm{D}}\downarrow\) |
---|---|---|---|---|---|
Bi-LSTM + Attention (Our Method) | 52.60 | 0.7140 | 0.5506 | 0.1700 | 0.0915 |
Deep learning baselines | |||||
sent2affect [48] | 37.20 | 0.3339 | 0.1075 | 0.2241 | 0.1428 |
SS-BED [44] | 50.40 | 0.6139 | 0.5098 | 0.1771 | 0.1090 |
Kim’s CNN [64] | 47.20 | 0.5437 | 0.4451 | 0.1987 | 0.1200 |
Bi-LSTM [48] | 49.89 | 0.6007 | 0.5059 | 0.1812 | 0.1074 |
LSTM [9] | 49.20 | 0.6015 | 0.5248 | 0.1842 | 0.1089 |
GRU | 46.00 | 0.5673 | 0.5003 | 0.2005 | 0.1098 |
Lexicon based baselines | |||||
SWAT [11] | 46.00 | 0.4945 | 0.3981 | 0.2453 | 0.1354 |
Emotion Term Model [12] | 49.40 | 0.5642 | 0.0167 | 0.3031 | 0.1975 |
Synesketch [33] | 35.86 | 0.3705 | 0.3570 | 0.2470 | 0.1510 |
Problem transformation baselines | |||||
WMD [39] | 40.50 | 0.1447 | 0.0459 | 0.2430 | 0.1143 |
45.60 | 0.4954 | 0.4039 | 0.2080 | 0.1135 | |
45.00 | 0.4992 | 0.3931 | 0.2089 | 0.1189 | |
TEC [32] | 45.20 | 0.5451 | 0.4219 | 0.2028 | 0.1219 |
TEI [32] | 45.60 | 0.5900 | 0.4635 | 0.2985 | 0.1228 |
MEI [32] | 45.60 | 0.4884 | 0.4071 | 0.2051 | 0.1257 |
GEC [32] (\(\delta =0.25\)) | 40.80 | 0.4643 | 0.3398 | 0.2113 | 0.1251 |
GEI [32] (\(\delta =0.25\)) | 44.00 | 0.4416 | 0.3207 | 0.2136 | 0.1291 |
39.04 | 0.5604 | 0.3820 | 0.2089 | 0.1208 | |
SSWEu [63] (\(d=50\)) | 34.56 | 0.3130 | 0.1152 | 0.2300 | 0.1272 |
GloVe [44] (\(d=100\)) | 33.12 | 0.2605 | 0.1088 | 0.2378 | 0.1152 |
Algorithm adaptation baselines | |||||
46.40 | 0.4799 | 0.3941 | 0.2059 | 0.1206 | |
46.80 | 0.5135 | 0.4140 | 0.2027 | 0.1171 | |
TEC [32] | 46.40 | 0.5639 | 0.4270 | 0.2021 | 0.1204 |
TEI [32] | 49.60 | 0.6034 | 0.4993 | 0.2005 | 0.1122 |
MEI [32] | 46.40 | 0.4949 | 0.4103 | 0.2062 | 0.1306 |
GEC [32] (\(\delta = 0.25\)) | 46.00 | 0.4861 | 0.3622 | 0.2089 | 0.1229 |
GEI [32] (\(\delta = 0.25\)) | 46.70 | 0.4722 | 0.3531 | 0.2099 | 0.1248 |
40.00 | 0.5732 | 0.3798 | 0.2023 | 0.1193 | |
40.80 | 0.2071 | 0.0595 | 0.4032 | 0.1641 | |
GloVe [44] (\(d=100\)) | 42.40 | 0.2261 | 0.0777 | 0.4022 | 0.1643 |
Model behavior analysis
Ablation study over attention layer—uniform attention as the adversary
Approach | Acc@1 (%)\(\uparrow\) | \(\hbox {AP}_{{\mathrm{document}}}\uparrow\) | \(\hbox {AP}_{{\mathrm{emotion}}}\uparrow\) | \(\hbox {RMSE}_{{\mathrm{D}}}\downarrow\) | \(\hbox {WD}_{\mathrm{D}}\downarrow\) |
---|---|---|---|---|---|
REN-10k | |||||
Uniform Attention | 54.36 | 0.6963 | 0.4019 | 0.2200 | 0.1125 |
Bi-LSTM + Attention (Baseline-Our Method) | 60.55 | 0.7994 | 0.5596 | 0.1500 | 0.0812 |
RENh-4k | |||||
Uniform Attention | 46.87 | 0.6156 | 0.3515 | 0.2435 | 0.1357 |
Bi-LSTM + Attention (Baseline-Our Method) | 50.50 | 0.6499 | 0.4054 | 0.2301 | 0.1220 |
SemEval-2007 | |||||
Uniform Attention | 46.98 | 0.6490 | 0.5255 | 0.2050 | 0.1105 |
Bi-LSTM + Attention (Baseline-Our Method) | 52.60 | 0.7140 | 0.5506 | 0.1700 | 0.0915 |
Qualitative evaluation
Quantitative evaluation
Dataset | DepecheMood++ | EmoWordNet | NRC-Affect Intensity Lexicon |
---|---|---|---|
REN-10k | 80.26 | 53.03 | 9.66 |
RENh-4k | 88.11 | 67.13 | 13.67 |
SemEval-2007 | 94.69 | 86.50 | 20. 28 |
Dataset | DepecheMood++ | EmoWordNet |
---|---|---|
Behavioural similarity scores | ||
REN-10k | 0.8829 | 0.8497 |
RENh-4k | 0.7096 | 0.6988 |
SemEval-2007 | 0.8092 | 0.8040 |
Word similarity scores | ||
REN-10k | 0.8296 | 0.8010 |
RENh-4k | 0.6851 | 0.6606 |
SemEval-2007 | 0.8203 | 0.7919 |
Word probability scores | ||
REN-10k | 0.9043 | 0.8901 |
RENh-4k | 0.7648 | 0.7205 |
SemEval-2007 | 0.8981 | 0.8624 |