Top

Journal of Big Data

Published in:

Open Access 01-12-2021 | Research

Deep learning for emotion analysis in Arabic tweets

Authors: Enas A. Hakim Khalil, Enas M. F. El Houby, Hoda Korashy Mohamed

Published in: Journal of Big Data | Issue 1/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Currently, expressing feelings through social media requires great consideration as an essential part of our lives; besides sharing ideas and thoughts, we share moments and good memories. Social media such as Facebook, Twitter, Weibo, and LinkedIn, are considered rich sources of opinionated text data. Both organizations and individuals are interested in using social media to analyze people's opinions and extract sentiments and emotions. Most studies on social media analysis mainly classified sentiment as positive, negative, or neutral classes. The challenge in emotion analysis arises because humans can express one or several emotions within one expression. Human beings can recognize these different emotions well; however, it is still not easy for an emotion analysis system. In most cases, the Arabic language used through social media is of a slangy or colloquial form, making it more challenging to preprocess and filter noise since most lemmatization and stemming tools are built on Modern Standard Arabic (MSA). An emotion analysis model has been implemented to categorize emotions. The model is a multiclass and multilabel classification problem. However, few studies have been adapted for this emotion classification problem in Arabic social media. Nearly the only work is the one of SemEval 2018 task1- sub-task E-c. Several machine learning approaches have been implemented in this task; a few studies were based on deep learning. Our model implemented a novel multilayer bidirectional long short term memory (BiLSTM) trained on top of pre-trained word embedding vectors. The model achieved state-of-the-art performance enhancement. This approach has been compared with other models developed in the same tasks using Support Vector Machines (SVM), random forest (RF), and fully connected neural networks. The proposed model achieved a performance improvement over the best results obtained for this task.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

With the rapid growth of web applications, such as E-commerce platforms and substantial social media comments in various fields, an urgent need to deal with this massive amount of web data and automatically extract helpful information has arisen. Sentiment analysis models play a significant role in this task. Sentiment analysis is a computational field within natural language processing (NLP) concerned with people's sentiments and opinions toward objects such as services, persons, products, events, organizations, and topics. Thanks to the availability of high-performance computational computers, which allows using different machine learning techniques, especially deep learning, to build high-performance, robust automatic sentiment analysis models. Sentiment analysis detects positive, neutral, or negative opinions from the text. Emotion analysis is one of the most common sentiment analysis tasks for recognizing different feelings through text expression.

Emotions are mainly expressed using language; however, some emotions, such as joy, fear, and sadness, are more fundamental than others and can be expressed differently. For example, our degree of utterances can reveal that we are so sad and slightly angry. The term affect refers to various categories of emotions as joy, fear, arousal, and valence. Emotions detection is significant in many fields, such as public health, marketing, disaster management, public policy, and political issues [1]. There are two categories of representations for emotions; Ekman model [2] and Plutchik model [3, 4]. The Ekman representation includes anger, happiness, disgust, surprise, fear, and sadness. However, the Plutchik includes Ekman's six emotions and two labels: trust and anticipation. Emotion recognition systems from facial expressions and images have been widely used [5‐8]. Hand gesture recognition has been used as a Human action recognition (HAR) [9]. Emotion recognition models can also be explored based on human–computer interaction [10‐12]. Emotion-rich textual data from social networks can be processed for various real-world applications, including [13‐17].

The research interest in Arabic sentiment analysis increases drastically due to the Internet’s large number of Arabic language users. However, emotion recognition from Arabic text still needs enormous efforts to develop more accurate emotion mining models in MSA and dialectal Arabic using a large-scale emotion lexicon.

The fast rise of social media platforms (e.g., Twitter and Facebook) attracted the researchers’ attention to the technique of “Affect detection from text,” which enabled users to communicate their sentiments, emotions, and ideas via text. SemEval, an international workshop on semantic evaluation developed from SensEval (Evaluation Exercises for the Semantic Analysis of Text, Organized by ACL-SIGLEX), produced an excellent value task, “The SemEval-2018 Task 1: Affect in Tweets” [1] in its 12th workshop on semantic evaluation in 2018. This task consists of five subtasks in which automatic systems infer a person’s emotional state from their tweet for each task in English, Arabic, and Spanish. Therefore, this study performed a multilabel emotion classification in the Arabic language using SemEval-2018 Task1-datasets. Several preprocessing steps are implemented, including removing non-Arabic words, removing digits, removing stop words, and applying a robust stemmer, Arabic light stemmer (ARLSTM). Then, the tweets are represented as feature vectors using AraVec ( a set of Arabic word embedding models for use in Arabic NLP) [18]. These embedded vectors are fed to a multilayer bidirectional long short-term memory (BiLSTM) network. The contribution of this study can be summarized as follows:

We proposed an optimization BiLSTM network for multilabel Arabic emotion analysis. However, building a multilabel multiclass classification model in emotion analysis is still considered an issue that must be tackled more and enhanced, especially in Arabic.
Different preprocessing procedures that suit the nature of social media colloquial text have been implemented.
We employed a CBOW word embedding model for word representation. Word embedding is proved to give the best results over other word representations.
We used a deep neural network of multilayers BiLSTM network with its great ability to extract the context information from Arabic text. We also investigated the effect of changing the number of BiLSTM layers to improve the performance.
Additionally, we investigated model hyperparameters tuning effect along with different optimizers on the model’s performance.
The model outperformed other machine learning models built for the same task.

The remainder of the paper is organized as follows: "Related work" section discusses related studies, including several Arabic text emotion analysis methods. "Methodology" section describes the proposed approach for evaluating the emotional content in tweets. "Experimental setup and results" section summarizes our findings and examines the most significant findings. "Discussion" discusses the results. Finally, "Conclusions and future work" section presents the conclusion and future studies.

Learning users’ emotions is essential in many applications. Emotions can be inferred through text expressions and HAR as facial expressions, hand gestures, body posture, and voice. In [9], social robots were used as communication assistance for education and entertainment. The study described an acceptable and natural interaction between social robots and children. The thermal facial reaction of youngsters, i.e., the nose tip temperature signal, was recorded and classified in real-time using the Mio Amico Robot during an experimental session. The categorization was performed by comparing the thermal signal analysis-classified emotional state to the emotional state recorded by Face reader 7. An empathic robot in [6] recognized human emotions through facial expressions and automatically responded to these specific emotional states. It produced a state-of-the-art accuracy rate of 95.58%. Additionally, it implemented a convolutional neural network (CNN) and a bank of Gabor filters in different experiments for feature representation. It also employed SVMs and multilayer perceptron as classifiers. Customer feedback detection in [12], a multimodal affect recognition system was developed to classify whether a customer likes or dislikes a product examined at a counter by analyzing the consumer’s facial expression, hand gestures, body posture, and voice after testing the product. Hand gesture recognition is widely used in scientific research; it is crucial for interacting with deaf individuals. Constantine et al. [10] proposed a transfer-learning approach using AlexNets and hyperparameter tuning using ABC, GA, and PSO algorithms. The methodology produced effective outcomes with an average accuracy of 98.09%, outperforming the best work in the medical sector. In [19], computational analysis techniques measured the emotional facial expression of people with Parkinson’s disease (PD). It is essential to examine an experimental pilot work for masked face detection in PD since PD experiences hypomania, which often reduces facial expression. This experiment achieved an accuracy of 85% on the testing images using a deep learning-based model.

Multilabel emotion classification is a hot topic in emotion analysis tasks since it represents real-life situations where the human may express a mixture of emotions in their text simultaneously. For example, the text may express happiness, love, optimism, sadness, or pessimism. Thus, building models with more than one output emotion for each input text is more beneficial. The following few paragraphs present some recent efforts in Arabic multilabel emotion classification.

Emotion mining in Arabic (EMA) [16] performed emotion and sentiment mining on Arabic tweets. First, preprocessing steps are performed by applying normalization rules adopted in [20], including diacritics or tashkeel, hamza, elongations, and non-Arabic letter removals. Then, most frequent emojis are replaced with the corresponding Arabic word using a manually created lexicon to replace each emoji. Finally, ARLSTEM [21] is used for stemming. In the feature selection stage, the author tried different features separately. However, the word embedding from AraVec proved to be the best feature. The tweet is finally classified either as neutral or as one or more of the 11 emotions (anger, disgust, anticipation, joy, love, optimism, fear, pessimism, sadness, trust, and surprise). A linear support vector classifier (SVC) achieved the best performance among all classifiers tested, with a test accuracy of 0.489.

TW-Star [22] used different preprocessing stemming (Stem), lemmatization (Lem), stop words removal, and common emoji recognition (Emo). The preprocessed tweets are classified by a binary relevance (BR) multilabel classifier using SVM with term frequency inverse document frequency (TF-IDF) features. Several experiments with different combinations of preprocessing achieved the best results of accuracy 0.465 by combining Emo, Stem, and Stop (Emo + Stem + Stop).

TeamUNNC [23] performed tokenization, white space removal, and punctuation treatment as individual words. In the second stage, word2vec embedding AraVec [18] were combined with Affective Tweets Weka-package features. Finally, the classification is implemented with a fully connected NN with three dense hidden layers and stochastic gradient descent (SGD) optimizer. The model achieved an accuracy of 0.446, exceeding the baseline model accuracy.

In [24], feature vectors were developed using the Doc2Vec model. Then, the random forest (RF) algorithm was used for classification, and Doc2Vec size varied from 10 to 1000 with an increment of 10 iterations. The number of decision trees used in the forest ranged from 10 to 150, with an increment of 10 with each iteration. The maximum tree depth in the algorithm varied from 2 to 20, with an increment of 1 with each iteration. This model achieved an accuracy of 0.25.

TeamCEN [25] Uses Globe vector representation [26] for representing the words into vectors.; it depends on word-word co-occurrence statistics. Then the presentation of the tweet is made by using aggregated sum and dimensionality reduction of the glove vectors of the words in that tweet. The classification is then done using RF and SVM.

Therefore, we developed a novel BiLSTM deep learning model for multilabel emotion classification in Arabic tweets using the same dataset as the models described above. The model performance surpassed the above models owing to the ability of LSTM networks to handle the sequential data as texts better than other machine learning and deep learning models.

Methodology

This section describes the approach we followed to develop a framework for predicting users’ emotions from their tweets; the framework is shown in Fig. 1. The framework includes the following pipeline:

Data preprocessing:

First, the tweet dataset has been preprocessed. The performed preprocessing steps are presented in the following steps:

Initial preprocessing:

Text normalization has been applied, including removing elongation (Tatweel) in Arabic words like changed to changed to , removing repeated characters and digit removal. Additionally, trimming special characters, removing English characters (a–z A–Z), French characters (àéèæoeç), and replacing emotions into
Stop word removal: In natural language, stop words are those words that do not add to the meaning or have very little sense. Stop words are usually removed from the text before training the model. Stop words occur more frequently in the text; thus, they do not add valuable information for classification or clustering. Our Arabic stop word list is updated from the NLTK Arabic stop word¹,². Our updated list considered the change in stop words resulting from removing Hamza and Yaa as , , , ). Some ambiguous words from this list as instead of are not considered in our updated list of stop words for not to increase ambiguity.

Fig. 1
The structure of proposed framework

×
Creating of emoji lexicon:

A lexicon with the most shared tweet emojis is manually created, where each emoji is transcribed to its corresponding Arabic word. Then, emojis are replaced with related meanings to emotions.
Stemming:
Further normalization step is stemming. It reduces the word to its standard form. Here, the stemming process is performed using ARLSTEM [21]. With social media data (tweets data), using a word stem is more valuable than its lemma because tweets are mostly dialectal Arabic not MSA form; most Arabic morphological analyzers are trained using MSA [17]. The ARLSTEM normalizes the word by removing diacritics; this may cause ambiguity in word semantics. However, it is interesting since it facilitates the stemming process. ARLSTEM replaces hamzated Alif with Alif, Alif Maqsura with Yaa, and removes Waaw at the beginning. Prefixes from the words’ beginning and suffixes from the word’s end are trimmed. Stemming also includes transforming the word from the feminine form to the masculine form, Stem the verb prefixes and suffixes, or both. In the experiments below, the proposed model has been examined with different variations of stemming either to use the stemmer in its basic form or to use it and exclude some special words frequently repeated in tweets and not stemming that words, such as the word ه" and all the derivations as , and replace all these word derivations with one standard form . This minor modification has a good impact on the performance since the count of occurrence of this word represents about 16% of the number of words in our dataset, and normalizing these words not stemming them reduces the ambiguity.

Feature extraction:

Feature selection and extraction are two major techniques used to improve the performance of different machine learning models by removing redundant and irrelevant features and reducing the dimensionality space [27]. Rostami et al. [28] presented the most recent feature selection approaches that involve using a genetic algorithm for feature selection. It is proved to be efficient in many machine learning classifiers that do not have high computational complexity. Rostami et al. [29] produced a machine learning classification model that increased the classification accuracy and succeeded in computational complexity reduction using a constrained feature selection. In our sentiment analysis problem, word embedding proved to be a more effective feature extraction technique than n-gram and TF-IDF. In this study, we employed the AraVec embedding model [28]. AraVec is a large-scale dataset (about 205,000 words) that consists of different Arabic dialects, and it is trained on the Twitter data domain. Word embedding is proved to be decisive since it overcomes the sparsity problem in n-grams models and simplifies semantics by providing exact representations for words that may exist in the same context. The pre-trained model, “Twitter-CBOW/tweets_cbow_300,” loaded by gensim libraries in Python has been used. A 300 dimensions real number vector represents the word; the tweet embeddings are calculated by taking the average embedding for all words in the tweet. Then, the average embedded vector of each tweet is fed to the classifier and classified either as neutral or as one or more of the 11 emotions (disgust, anger, fear, pessimism, anticipation, joy, love, optimism, sadness, trust, and surprise).

Network architecture

We build a deep learning model of recurrent neural network (RNN), BiLSTM layers. A BiLSTM is a sequence model with two LSTMs: forwarding and backward direction input. BiLSTM increases the amount of information available to the network and improves the context (e.g., knowing the following and the preceding word in a sentence). As a result, it learns faster than the one-directional approach, although it depends on the task. The basic structure of BiLSTM is shown in Fig. 2.

The proposed model contains three BiLSTM layers with 300, 200, and 50 inputs. Each layer has a Relu activation function. A dropout layer follows each BiLSTM with three dropout layers, one after each BiLSTM layer. Following the first BiLSTM is a repeater layer. The last layer in the model is a dense layer with 11 outputs corresponding to the 11 emotions. The activation is a sigmoid function; it gives a probability of each emotion. We approximate the values to 0, 1 class for each 11 output representing the 11 emotions. Fig. 3 shows the network structure.

Experimental setup and results

The work has been implemented on a Dell G5 15 laptop Intel i7 10th Gen, CPU 2.6 GHz with 6 GB GPU Nvidia Geoforce RTX2060. Libraries of Scikit learn 0.24.2, genism 3.8.3, and Keras 2.3.0 libraries under the tensorflow2.1.0 platform in Python 3.6.13 have been used. The pre-trained word embedding model “Twitter-CBOW/tweets cbow 300” is loaded by gensim libraries in Python. Dataset is provided publicly by SemEval 2018 task1, the E-C subtask for the Arabic language [1]. This task has 2278 tweets for training, 585 tweets for development, and 1518 tweets for test data. We concatenated three datasets with a total size of 4381 tweets for the cross-validation process. The proposed model has been evaluated using the following metrics.³

Multilabel accuracy (or Jaccard index): “it is the size of the intersection of the predicted and actual label sets divided by the size of their union.” It is computed for each tweet t and averaged over all tweets in the dataset T:
$${\text{Accuracy}}\frac{1}{{\left| {\text{T}} \right|}}\mathop \sum \limits_{{{\text{t}} \in {\text{T}}}} \frac{{{\text{Gt}} \cap {\text{Pt}}}}{{{\text{Gt}} \cup {\text{Pt}}}}{ }$$

(1)

Here, Pt is the set of the predicted labels for tweet t, Gt is the set of the actual (gold) labels for tweet t, and T is the set of tweets
Precision: It is the number of true positive results divided by the number of positive results, including those unidentified correctly,
$${\text{PPV}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$

(2)
Recall: It is the number of true positive results divided by the number of samples that should have been identified as positive. Recall is also known as sensitivity in diagnostic binary classification.
$${\text{Recall }}\left( {\text{sensitivity or true Positive Rate TPR}} \right){\text{ TPR}} = \frac{{{\text{TP}}}}{{\text{P}}} = { }\frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$

(3)

where TP is true positive, FP False Positive, and FN is False Negative.

F1 score ${\text{is the harmonic mean of precision and recall}}$
$${\text{F}}1{\text{ score}} = 2.\frac{{{\text{PPV}} \times {\text{TPR}}}}{{{\text{PPV}} + {\text{TPR}}}}$$

(4)
Micro-averaged metrics differ from the overall accuracy when the classifications are multilabel. Thus, it is essential to clarify the difference between micro and macro averages (precision, recall, and F1-score). In macro-average, the metrics are independently computed for each class. Then, the average is calculated (all classes are treated equally), whereas micro-average combines the contributions of all classes to compute the average metric. Thus, micro-average is preferred in a multiclass classification setup, especially in imbalance class cases (i.e., having more examples of one class than other classes).
Micro-averaged F-score is computed as follows:
$${\text{Micro}} - {\text{avg Precision}}\left( {{\text{micro}} - {\text{P}}} \right) = \frac{{\mathop \sum \nolimits_{{{\text{e}} \in {\text{E}}}} {\text{number of tweets correctly assigned to emotion class e}}}}{{\mathop \sum \nolimits_{{{\text{e}} \in {\text{E}}}} {\text{number of tweets assigned to emotion class e}}}}$$

(5)
$${\text{Micro}} - {\text{avg Recall}}\left( {{\text{micro}} - {\text{R}}} \right) = { }\frac{{\mathop \sum \nolimits_{{{\text{e}} \in {\text{E}}}} {\text{number of tweets correctly assigned to emotion class e}}}}{{\mathop \sum \nolimits_{{{\text{e}} \in {\text{E}}}} {\text{number of tweets in emotion class e}}}}$$

(6)
$${\text{Micro}} - {\text{avg F}} = \frac{{2{\text{Xmicro}} - {\text{P Xmicro}} - {\text{R}}}}{{{\text{micro}} - {\text{P}} + {\text{micro}} - {\text{R}}}}$$

(7)
where E is the given set of emotions(eleven in our model)
Macro-averaged F-score is calculated as follows:
$${\text{Precision}}\left( {{\text{Pe}}} \right) = \frac{{\text{number of tweets correctly assigned to emotion class e}}}{{\text{number of tweets assigned to emotion class e}}}$$

(8)
$${\text{Recall}}\left( {{\text{Re}}} \right) = \frac{{\text{number of tweets correctly assigned to emotion class e}}}{{\text{number of tweets in emotion class e}}}{ }$$

(9)
$${\text{Fe}} = { }\frac{{2{\text{XPeXRe}}}}{{{\text{Pe}} + {\text{Re}}}}$$

(10)
$${\text{Macro}} - {\text{avg F}} = \frac{1}{{\left| {\text{E}} \right|}}\mathop \sum \limits_{{{\text{e}} \in {\text{E}}}} {\text{Fe }}$$

(11)

Experimental setup

All experiments are conducted using cross-validation with the number of splits k = 10. Each time one split is taken as a test, another nine splits are for training, and the number of repeats = 3, giving a total of 30 trials in each experiment.

After several experiments, the parameters are tuned as follows. The learning rate is adjusted to a value of 0.001, optimizer algorithm “Adam” with loss “mse” and “accuracy” metrics.

Preprocessing variations

First, the effect of using or not using the emoji lexicon stated in the methodology section has been investigated. These experiments have been tested with different stemming. Different optimizers are commonly used for optimizing SGD; we tried two optimizers: Adam and Nadam. These optimizers belong to adaptive learning rate methods and have been proven to provide the best results in deep learning sentiment analysis systems. The number of epochs is set to 50. The batch size is set to 200; three dropout rates are set to 0.5, 0.5, and 0.5 for the three dropout layers. The learning rate is set to 0.001. Additionally, the activation functions for all BiLSTMs are set to a Relu function to avoid gradient-vanishing problems and overfitting. The loss function is “mse,” and the output dense layer activation is set to sigmoid function. The experiments are conducted with and without emoji lexicon (Table 1). Table 1 presents the impact of whether or not to exclude the word

https://static-content.springer.com/image/art%3A10.1186%2Fs40537-021-00523-w/MediaObjects/40537_2021_523_Figo_HTML.gif

and the derivations as

https://static-content.springer.com/image/art%3A10.1186%2Fs40537-021-00523-w/MediaObjects/40537_2021_523_Figp_HTML.gif

from any text normalization or stemming (modified stemming), this positively impacts the accuracy of about 0.1% to 0.3% in the Jaccard index using the emoji lexicon. Best results obtained are when using Emoji lexicon with modified stem and Adam optimizer as shown in bold.

Table 1

Results of the proposed model with and without emoji lexicon

Optimizer	Adam				Nadam
	F1	P	R	Jacc-index	F1	P	R	Jacc-index
With Emoji Lexicon
Stem	0.613	0.696	0.549	0.494	0.614	0.691	0.553	0.496
Modified stem	0.615	0.695	0.551	0.498	0.615	0.69	0.555	0.497
Without Emoji Lexicon
Stem	0.591	0.669	0.529	0.471	0.593	0.665	0533	0.476
Modified stem	0.591	0.67	0.528	0.471	0.593	0.664	0.536	0.475