Introduction
-
Develop the largest labeled Afaan Oromo hate speech classification dataset of his kind.
-
This work investigates the accuracy of five state-of-the-art deep learning models at detecting hate speech for resource-scarce languages, i.e., Afaan Oromo. The output of the experiment provides insight into their detection accuracy, and capability in using pre-trained models, and text data augmentation, and offers important guidelines for their deployment in real-world applications.
-
Assess the impact of adding augmented textual data on the Afaan Oromo Hate Speech classification performance
-
Assess the impact of using pre-trained Word2Vec model with the one directly trained with the hate speech classification model
-
Build a pre-trained word embedding model, which is useful for other works in this area.
Background and related work
Word embedding
Related work
Afaan Oromo Corpus creation and annotation
Corpus collection
Page/account names | Page/account names |
---|---|
FBC Afaan Oromoo TV | Ethiopian Press Agency/Bariisaa |
BBC Afaan Oromo TV | Oromia Democratic Part/ODP |
OMN TV | Kush Media |
Fanabc Afaan Oromo | OBS |
Jawar Mohammed | Taye Dendea Aredo |
Class | Class label | No of texts |
---|---|---|
Neutral | 0 | 10,525 |
Hate | 1 | 10,525 |
Offensive | 2 | 10,525 |
Both | 3 | 10,525 |
Annotation guideline
-
If a post/comment uses references to the alleged inferiority or superiority of some target groups.
-
If a post/comment affects different characteristics of the person and motivates audiences to take action or make violation.
-
If a post/comment contains stereotype which means over-generalized belief about a given target.
-
If a post/comment Accusing or Condemning people based on their target groups.
-
If a post or comment contains violent or insulting words but not possible to explicitly identify a target group in the post/comment.
-
If a post or comment contains defamation, which is a false accusation a person or attack on a person’s character.
-
If a post or comment contains insulting, dirty, disgusting, or upsetting words but does not motivate the people to take action.
-
If there is a combination of hateful expression, and use an insult, threats, or derogatory terms toward a target’s groups.
Data preprocessor
Feature engineering
Feature learning and classification using deep leaning models
-
CNN which is a class of Deep Learning model that use convolutional layers and maximum pooling or max-overtime pooling layers to extract higher-level features.
-
LSTM is a powerful kind of RNN used for processing sequential data such as sound, time series (sensor) data or written natural language.
-
BiLSTM is is a hybrid bidirectional LSTM and CNN architecture.
-
GRU is similar to long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate.
-
CONV-LSTM is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions.
Hyparameter name | Hyparameter value |
---|---|
Number of Convolution Layer | 3 |
Number of Filters in Convolution Layer | 250 |
Filter Size | \(3 \times 3\) |
Dropout Rate | 0.5 |
Batch Size | 128 |
Embedding Dimesion | 300 |
Hidden Layer Activation Function | Relu |
Output Layer Activation Function | SoftMax |
Optimizer | AdaGrad |
Learning Rate | 0.001 |
Experiment and discussion
Evaluation setup
Metrics | Formula |
---|---|
Precision | TP/(TP+FP) |
Recall | \(TP/(TP+FN)\) |
F1-Score | 2\(\times \)((precision \(\times \) recall)/(precision + recall)) |