Introduction
Due to the increased amount of data from user-generated content on social media, text classification has become an important area of research in the last 10 years. This has led researchers to apply text classification methods for analyzing sentiments and topics [
1‐
3], predicting gender [
4‐
6], and detecting false news [
7,
8]. Studies on social media have indicated that, as a wide variety of people use this medium to share health information [
9,
10], the information provided is not always accurate [
11,
12] and this is a huge issue of concern. However, a precursor for studying the trustworthiness of health-related tweets is the development of a model to detect health-related information posts on social media.
Additional important reasons for devising a high-quality method for identifying health-information posted on social media could include building and/or studying health communication theories, evaluating health communication, and understanding public concerns on social media during an outbreak [
13‐
15]. Studies that built models to detect (English) health information tweets were conducted by Paul et al. and Tuarob et al. [
16,
17], who developed machine-learning models to detect health-related information on social media platforms.
Unfortunately, these models are highly language-dependent and, as they were not created for the Arabic language, they cannot be directly applied to this language, an important consideration given the prevalence of social media usage in Arabic countries [
11]. For example, text normalization is one of the important steps in text classification. In English, this might include normalizing capital letters to lowercase letters, yet there are no lowercase and capital letters in Arabic; normalizing letters in Arabic involves normalizing different forms of alefs (
ا إ أ. )to (ا) or removing diacritics that are not used in English. Thus, Maw et al. [
18] pointed out that even if some algorithms perform well for a particular language, they might yield worse results when applied to another language.
There have been many studies of text classification regarding Arabic natural language processing on social media. Most of them are focused on sentiment analysis, and a number of literature surveys and systematic literature reviews have been conducted on this Arabic-language-classification-specific task [
1‐
3]. More specifically, Al-Rubaiee et al. [
19], Alayba et al. [
20], and Alabbas et al. [
21] conducted targeted sentiment-analysis studies. Al-Rubaiee et al. [
19] used sentiment analysis to evaluate a bank application. They collected tweets about the bank service and labelled them as either positive or negative. They then pre-processed the tweets using various techniques and compared the performance of the Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers. The best results were for SVM with an accuracy of 89.68%.
Similarly, Alayba et al. [
20] collected tweets about health services in Saudi Arabia and labelled them as positive or negative. The best results were achieved using stochastic gradient descent with an accuracy of 91.87%. Moreover, Alabbas et al. [
21] trained a classifier to detect natural disasters by labelling tweets, some of which contained information about a flood whereas others did not. They trained different classifiers, namely, SVM with K-Nearest Neighbors (KNN), NB, and compared their performance. The best model was SVM with an accuracy of 90.7%. Alayba and Alabbas studies are expanded on in the next section.
Other Arabic-text classification work used social media data to detect hate speech [
22‐
24] and analyze crisis responses, such as in the event of a flood [
25]. However, there is a lack of studies based on detecting Arabic-language health-related tweets. In this paper, we aim to derive a model to accurately detect Arabic language health data on Twitter and test these models on data sets to evaluate the generality thereof.
Statistics show that Twitter is very popular with Arabic speakers, and that it is widely used for sharing health-related information [
9,
10]. As such, one of the goals of this paper is to enrich the literature by providing technical details for the development of a model to detect Arabic health-related tweets. Devising such a model can help researchers from many disciplines study health-related tweets in a more comprehensive manner and will provide the foundation for empirical studies that are not conducted with a focus on tweets with a specific origin only (where the origin serves as a means of determining their health-information focus by, for example, only considering tweets emanating from specific health-tweet authors/organizations). For example, while Alnemer et al. [
12] extracted tweets from specific health-related Twitter accounts in order to study health-related information on social media, Albalawi et al. [
11] pointed out that there are other users who also (more informally) tweet about health and that those should not be ignored in an analysis of health tweets. A model that can automatically extract health-related tweets can further the holistic study of health-related tweets without requiring that specific health-related accounts are followed. Furthermore, providing the technical details for the development of such a model will enrich the literature, not only for this specific text classification task (i.e., extracting health-information tweets), but also for other Arabic-text classification tasks.
This paper is structured as follows. First, we discuss related work in Sect. "
Related work". In Sect. "
Methods", we describe the general methods used in this study, focusing especially on the data sets and evaluation metrics employed. Section "
First experiment" reports on the study that assesses the impact of various pre-processings on traditional machine learning techniques, when classifying health-related tweets. Subsequently, Sect. "
Second experiment" describes a second study which looks at the impact of different word embeddings on deep learning algorithms for the same purpose. Finally, Sects. "
Discussion" and "
Conclusion" discuss and compare the results, drawing out conclusions from this work.
There is a vast body of literature on Arabic text classification for social media. Alayba et al. [
20] analyzed tweets to detect sentiment about services in Saudi Arabia. They collected tweets using trending hashtags related to health services, and then they divided their data sets into two categories: negative and positive. When processing the tweets, they removed diacritics and Kashida and normalized three additional letters:
اأإ to
ا.,
ة to
ه, and
ئ to
ى; and they used unigram and bi-gram text extraction techniques with Term Frequency–Inverse Document Frequency, hereafter TF-IDF, for feature selection. They then compared the performance of seven algorithms and experimented with a Convolutional Neural Network (CNN). The best results were achieved with a stochastic gradient analysis and SVM, with an accuracy of 91.87. They did not use any stemming methods during pre-processing.
Alabbas et al. [
21] developed a model to detect a natural disaster in tweets, specifically a high-risk flood. To achieve this, they trained a classifier on labelled tweets; some containing information about a flood and others that did not. They removed diacritics from the text based on the assumption that most text is written without diacritics. In a manner similar to that of Alayba, they used TF-IDF for feature selection. During their study, Alabbas et al. investigated the performance of different classifiers, specifically the NNET, SVM, KNN, Decision Tree (C4.5–J48), and NB algorithms. Unlike Alayba, they also compared different stemming techniques for the Arabic language: no stemming, light stemming, and prefix/suffix removal. They also normalized one letter,
اأإ to
ا. The authors concluded that SVM performs better than the other algorithms, and that most of the algorithms included in the study perform better without stemming.
Boudad et al. [
26] compared the performance of KNN, SVM and NB in sentiment analysis for Arabic tweets. Moreover, they compared the impact of different types of stemming, specifically light stemming and root stemming; and they also compared TF-IDF to Binary Term Occurrence (BTO) for feature selection. They found that the best accuracy is achieved with light stemming, the SVM classifier, and TF-IDF for feature selection. During the normalization process, they normalized
ه and
ى in addition to removing hashtags. It is not obvious whether their findings contradict those of Alabbas et al. however, as the model in the earlier study was not trained without stemming.
Duwairi et al. [
27] and Oussous et al. [
28] studied the impact of root stemming and light stemming in addition to stop word removal on sentiment analysis. While Oussous et al. found that light stemming improves the accuracy, Duwairi et al. stated that stemming and stop word removal do not improve the accuracy of their model. Furthermore, these studies have not investigated the impact of the other pre-processing techniques discussed above. Although, Oussous et al. removed tashkeel, duplicate letters and Kashida, they did not report the impact of such steps on the results of their model.
Abdulla et al. [
29] built a model to detect the sentiment of tweets. They found that light stemming decreases model accuracy, which supports the findings of Alabbas et al. In comparison to Boudad et al., however, they only normalized two letters,
ه and
ا. Like the studies mentioned above, they did not investigate the impact of normalizing letters on the accuracy of their model.
Alakrot et al. [
24] developed a model to detect hate speech in YouTube comments, which they trained on 15,000 comments labelled as either positive or negative. They normalized the same letters as Alabbas et al. [
21] along with two additional letters, because of the similar morphological sounds thereof. Their best model achieved an F
1 score of 82%, and they reported the usefulness of stemming and normalization, which contradicts Alabbas et al. [
21] and Abdulla et al. [
29].
As the studies described above suggest, there is no agreement on pre-processing steps for the Arabic language as the researchers used different techniques when normalizing the text. Alabbas et al. [
21] only normalized one letter,
أاإ; Boudad et al. [
26] and Abdulla et al. [
29] only normalized ة
ه; while Alayba et al. [
20] and Alakrot et al. [
24] normalized other letters. Furthermore, both Boudad et al. [
26] and Alakrot et al. [
24] reported the usefulness of stemming, while Alabbas et al. [
21] and Abdulla et al. [
29] found that stemming decreased the accuracy of their models. These conflicting results lead to questions as to which methods are the best for normalizing Arabic data sets, particularly for specific classification tasks.
In addition to traditional machine-learning algorithms, there has been a dramatic increase in the number of studies that apply different deep-learning methods for tackling the Arabic text classification task in the last few years. Some of these studies compared deep-learning models, such as CNN and Long Short-Term Memory (LSTM), to traditional machine-learning models. For example, Oussous et al. [
30] compared four models (NB, SVM, CNN and LSTM) to detect the sentiment of tweets. They also investigated the impact of pre-processing techniques, specifically normalizing, stop-word removal, and stemming. They used traditional BTO as feature extraction for NB and SVM, and they used Word2Vec for the word-embedding layer of the CNN and LSTM models. They concluded that normalizing with light stemming improves the accuracy of their model and that the CNN and LSTM classifiers perform better than the SVM and NB ones. They only considered normalizing three letters:
ي,
ة, and
ا.
It is worth noting that word embedding is a learning technique in natural language processing that represents words with vectors [
31], the dimensions of which are usually set prior to the word-embedding training. A high dimension vector offers a better opportunity to represent the word semantics [
22]. This technique uses geometric word encoding based on how frequently words appear together [
8]; thus, words with similar meanings are represented with similar numbers. Yet, to be efficient, word embedding need to be trained on large data sets [
32]. Thus, researchers often use already existing pre-trained word embedding as demonstrated by Mohaouchane et al. [
33].
They [
33] used the same data set that was used by Alakrot et al. [
24], and they followed similar pre-processing steps to Alakrot et al.. Mohaouchane et al. [
33] used AraVec pre-trained words [
34] that were embedded as the input layer for a CNN, and they improved the accuracy of detecting hate speech in this data set from an F
1 score of 82 to a score of 84.05.
In contrast to the studies by Oussous et al. [
30] and Mohaouchane et al. [
33], Abdullah et al. [
35] developed a CNN-LSTM model to detect the emotion of tweets. Unlike Oussous et al., Abdullah et al. [
35] used AraVec pre-trained words embedding for their input layer. They claimed that the normalizing and stemming steps did not improve the performance of their model.
Similar to Abdullah et al. [
35], Heikal et al. [
36] developed a model that uses AraVec pre-trained word embedding in the input layer. They also used different pre-processing techniques by removing diacritics, repeated characters and punctuation. They assembled a model that consisted of a CNN and LSTM architecture. The authors achieved an F
1 score of 64%, which they claimed outperforms a state-of-the-art algorithm.
The reason the above-mentioned studies [
33,
35,
36] utilized customized pre-processing techniques when using pre-trained word embeddings is unclear. According to Li et al. [
37], the ideal method to achieve the most improvement when using pre-trained word embedding is to follow the same steps that were used for the corpus when creating the embeddings vectors unless they are not well-documented. The pre-processing steps to normalize the data sets when using AraVec pre-trained word embeddings are documented and were provided by the models of Soliman et al. [
34].
Abuzayed and Elsayed [
38] investigated the performance of classical and deep-learning models when detecting hate speech in Arabic tweets. Their results showed that the classical TF-IDF word representation performs better than word embedding with classical algorithms, but the combined CNN-LSTM deep-learning architecture performs better than the classical algorithm. This observation might help to answer the question posed by Guellil et al. [
39]: “Are deep-learning approaches really more efficient than traditional approaches, such as SVM, NB, etc., for Arabic natural processing?” (p. 9). This is a core research agenda for this work, but in the context of classifying/identifying health tweets in particular.
While Mohaouchane et al. [
33], Abdullah et al. [
35], and Heikal et al. [
36] used AraVec pre-trained words embedding, there are additional pre-trained Arabic word embedding models that have been investigated. Alwehaibi and Roy [
40] asserted that pre-trained models require millions of words to be effectively trained; consequently, they investigated the usefulness of the AraVec, fastText, and the ‘Altowayan and Tao’ [
41] pre-trained word-embedding techniques for text classification. To compare these classification approaches, they developed a CNN-LSTM deep neural network model to predict the sentiment of tweets, and they found that the Altowayan and Tao [
41] pre-trained word-embedding method outperforms AraVec and fastText as the authors’ best model achieved 93.5% accuracy when classifying texts into positive, negative and neutral sentiment.
Utilizing a collection of 55 million tweets, Fouad et al. [
42] developed their own pre-trained word-embedding model by combining three popular techniques—Word2Vec Skip-Gram; Word2Vec Continuous Bag-of-Words (CBOW); and Global Vectors (GloVe). Using the CNN architecture, they compared the performance of their pre-trained word embeddings (ArWordVec) to that of AraVec pre-trained word-embedding methods and found their pre-trained model outperformed the AraVec model.
Based on the literature identified above, Table
1 presents the pre-trained word-embedding models that have been applied to the classification of Arabic texts.
Table 1
Pre-trained word embedding models
| 400 millions tokens from Wikipedia”, i.e. 400 million Wikipedia articles + “24 terabytes of raw text data” from Common Crawl | Common Crawl and Wikipedia | CBOW with sub-wording techniques applied to the methods | Open | Only tokenization |
| 66.9 million tweets and 320,636 documents from Wikipedia | Twitter and Wikipedia | CBOW and Skip-Gram with different n-gram and unigram features | Open | Remove non-Arabic letters. replace ة with ه. Normalize alef. remove duplicates, Normalize mentions, URLs emojis |
| 250 million tweets | Twitter | CBOW and Skip-Gram with different n-gram | Open | Removal URLs, Tashkeel, emojis and punctuation |
| 55 million tweets | Twitter | CBOW and Skip-Gram | Open | Normalize mentions, URLs. Remove tashkeel, punctuation, Normalize bare alef Replace ى" with "ي", Replace ؤ" with “ء", Replace ئ" with ء", Replace " ة with ه " |
It is worth noting that the majority of the studies reviewed above, which used Arabic social media for text classification tasks, used SVM followed by NB. There is also a recent trend of using deep-learning methods for Arabic text classification, where CNN and LSTM architectures were primarily used as deep learning methods. This observation is consistent with the findings of Oueslati et al. [
43], who conducted a review on the techniques used for sentiment analysis of Arabic-language tweets.
While several recent studies reported the effectiveness of using pre-trained words as the embedding layer for deep-learning models, there have been only few comparative studies of word-embedding techniques in the context of Arabic text mining. For example, four different studies [
33,
35,
36,
38] used AraVec, only, one study used fastText [
41], and no studies were found that used ArWordVec.
As for traditional methods, the majority of Arabic works have emphasized some pre-processing techniques, such as stemming, but none of the studies discussed determined the impact of normalizing Arabic letters or removing diacritics. Some claimed these techniques negatively affect the classifier performance [
44,
45] but did not elaborate on or provide evidence for their assertions. Furthermore, there have been no studies to date on the detection of Arabic health-related tweets on Twitter.
Thus, this paper aimed to investigate the impact of different pre-processing techniques on model accuracy. An additional aim was to employ deep-learning methods to compare the performances of pre-trained word-embedding techniques. This will be carried out through a text-classification task focused on detecting Arabic-language health-related tweets. Using these studies as pre-requisites, this study aimed to compare the best classifiers developed using deep learning methods to best classifiers developed using traditional machine learning (ML) methods to identify the overall best-of-breed classification approach available for health tweet identification.
Discussion
For the first experiment, which was concerned with pre-processing techniques, the best algorithm performance was achieved with 4 pre-processings out of a possible 26. Some of the popular techniques presented in Table
5 used by other researchers, such as normalizing
alef and
different types of stemming and removing numbers were not pre-processing methods that improved the accuracy of our final model in the first experiment.
In the literature, there was a focus on studying the impact of stemming on algorithm performance [
24,
26,
28,
84]. Most studies found stemming increased the accuracy of the baseline model [
24,
26,
84], and this study is in agreement with theses previous studies. Having said that, the best combination of the pre-processing techniques for our final model outperformed any combination of pre-processing that included any type of stemming, as shown in Appendix
2. It also important to note that, out the four pre-processing techniques that were used in the final model, only one can be considered as not being an Arabic specific pre-processing technique, which is the removal of the repeated character, Normalizing the letters ي andه are Arabic specific. Likewise, the fourth pre-processing technique removed Kashida, which is widely used by Arabic writers. This might suggest that in text classification for the Arabic language, Arabic specific normalization techniques might play a bigger role in improving the model performance compared to the other general pre-processing techniques. This possibility also highlights the importance of this study and the need for more studies to systemically assess the impact of normalizing Arabic specific techniques on the model performance of more data sets.
Nevertheless, we found that rarely used pre-processing techniques performed well in improving the classifier model. For example, lemmatization was only used in one study [
46] in the literature reviewed in this paper. Yet, as it can be seen in Table
5, lemmatization performed well with all four classifier models. Notwithstanding, it was not one of the four techniques that improved the accuracy of the final best MNB model in the first experiment. It is also worth noting that whereas the MNB classifier achieved an 87.7 F
1 score on the first data set, its performance decreased on the second data set.
In the second experiment, we noted two observations. Firstly, there was no big difference between Mazajak Skip-Gram and Mazajak CBOW in their performance on the first data set using BLSTM. This also applied for Mazajak Skip-Gram and CBOW with BLSTM on the second data set. Furthermore, this also applied for Mazajak and ArWordVec using the CNN architecture. In contrast, there was a noted difference when we compared the performance of AraVec CBOW to Skip-Gram: AraVec Skip-Gram performed better than AraVec CBOW in both architectures. The second observation is the AraVec performance slightly decreased between the two architectures, whereas the Mazajak, ArWordVec and fasText had a more notable decrease. This caused AraVec to perform better using CNN architecture than other pretrained word embedding models on the first data set. Furthermore, on the second data set using the CNN architecture, AraVec Skip-Gram performance had a negligible increase compared to BLSTM architecture.
When comparing deep-learning methods to traditional algorithms, the results for the first data set indicated that the BLSTM architecture with all pretrained model embeddings performed better than the MNB classifier except for AraVec CBOW, where the MNB classifier performed better. When models using the CNN architecture, were compared to the MNB classifier, it is found that the MNB classifier performed better than most CNN classifiers, except for the CNN classifier that used AraVec Skip-Gram as an input layer, as is reported in Tables
6 and
8. The CNN classifier that used AraVec as an input layer performed identically in terms of accuracy at 92.7% and only marginally different for F
1 score at 88.01% compared to 87.9%, where AraVec Skip-Gram performed better than the MNB classifier.
In the second data set, however, the CNN and BLSTM models both performed better with all the pre-trained word-embedding models than did the MNB classifier. The results suggest that the MNB classifier for the first data set is comparative to some deep learning methods, but all the deep learning methods outperformed the MNB classifier on the second data set, this data set representing more generalized, unseen data. This might contribute to answering the question in the literature that Guellil et al. posed [
39]: “Are deep-learning approaches really more efficient than traditional approaches?”. The answer, as determined in this experiment seems to be “yes” with regard to generality.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.