Abstract
The explosive growth of social media has fueled an extensive increase in online freedom of speech. The worldwide platform of human voice creates possibilities to assail other users without facing any consequences, and flout social etiquettes, resulting in an inevitable increase of hate speech. Nowadays, English hate speech detection is a popular research area, but the prevalence of implicit hate content in regional languages desire effective language-independent models. The proposed research is the first unsupervised Hindi and Bengali hate content detection framework consisting of three significant concepts: HateCircle, hate tweet classification, and code-switch data preparation algorithms. The novel HateCircle method is proposed to detect hate orientation for each term by co-occurrence patterns of words, contextual semantics, and emotion analysis. The efficient multiclass hate tweet classification algorithm is proposed with parts of speech tagging, Euclidean distance, and the Geometric median methods. The detection of hate content is more efficient in the native script compared to the Roman script, so the transliteration algorithm is also proposed for code-switch data preparation. The experimentation evaluates the combination of various lexicons with our enriched hate lexicon that achieves a maximum of 0.74 F1-score for the Hindi and 0.88 F1-score for the Bengali datasets. The novel HateCircle and hate tweet detection framework evaluates with our proposed parts of speech tagging and Geometric median detection methods. Results reveal that HateCircle and hate tweet detection framework also achieves a maximum of 0.73 accuracy for the Hindi and 0.78 accuracy for the Bengali dataset. The experiment results signify that contextual semantic hate speech detection research with a language-independency feature offsets the growth of implicit abusive text in social media.
- [1] Twitter Revenue and Usage Statistics. 2022. BusinessofApps. Retrieved January 11, 2022 from https://www.businessofapps.com/data/twitter-statistics/.Google Scholar
- [2] Statista Research Department. 2022. Number of Data Removal Requests Issued to Twitter from July to December 2020, by Country and Institution. Statista. Retrieved July 2022 from https://www.statista.com/statistics/234858/number-of-requests-for-data-removal-from-twitter/.Google Scholar
- [3] . 2020. A deep neural network based multi-task learning approach to hate speech detection. Knowledge-Based Systems 210 (Dec. 2020), 106458. https://doi.org/10.1016/j.knosys.2020.106458Google ScholarCross Ref
- [4] Wikipedia. 2022. List of Languages by Total Number of Speakers. Retrieved January 15, 2022 from https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers.Google Scholar
- [5] . 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR) 51, 4 (July 2018), 1–30. https://doi.org/10.1145/3232676Google ScholarDigital Library
- [6] . 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion. ACM, 759–760. https://doi.org/10.1145/3041021.3054223Google Scholar
- [7] . 2019. Multi-label categorization of accounts of sexism using a neural framework. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 1642–1652. https://doi.org/10.18653/v1/D19-1174Google ScholarCross Ref
- [8] . 2022. Ceasing hate with MoH: Hate speech detection in Hindi–English code-switched language. Information Processing & Management 59, 1 (Jan. 2022), 102760. https://doi.org/10.1016/j.ipm.2021.102760Google ScholarDigital Library
- [9] . 2021. Denigrate comment detection in low-resource Hindi language using attention-based residual networks. Transactions on Asian and Low-Resource Language Information Processing 21, 1 (Jan. 2022), 1–14. https://doi.org/10.1145/3431729Google Scholar
- [10] . 2020. Hate speech detection is not as easy as you may think: A closer look at model validation (extended version). Information Systems 105 (Mar. 2020), 101584. https://doi.org/10.1016/j.is.2020.101584Google ScholarDigital Library
- [11] . 2016. Contextual semantics for sentiment analysis of Twitter. Information Processing & Management 52, 1 (Jan. 2016), 5–19. https://doi.org/10.1016/j.ipm.2015.01.005Google ScholarDigital Library
- [12] . 2021. FB Didn't Flag Hate Speech in India as it Lacked Hindi, Bengali Classifiers: Haugen. Inshorts. Retrieved October 7, 2021 from https://inshorts.com/en/news/fb-didnt-flag-hate-speech-in-india-as-it-lacked-hindi-bengali-classifiers-haugen-1633598646476.Google Scholar
- [13] Luz Olivia Badillo. [n.d.]. For Every 10,000 Posts on Facebook, 15 are Hate Speech. Retrieved August 7, 2022 from https://tecreview.tec.mx/2021/11/25/en/for-every-10000-posts-on-facebook-15-are-hate-speech/.Google Scholar
- [14] . 2015. Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making. Policy & Internet 7, 2 (Apr. 2015), 223–242. https://doi.org/10.1002/poi3.85Google ScholarCross Ref
- [15] . 2017. Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of the 1st Workshop on Abusive Language Online. 78–84. https://doi.org/10.18653/v1/W17-3012Google ScholarCross Ref
- [16] . 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 11, 1, 512--515. https://doi.org/10.1609/icwsm.v11i1.14955Google ScholarCross Ref
- [17] . 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. 88–93.Google ScholarCross Ref
- [18] . 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. ACM, 145–153. https://doi.org/10.1145/2872427.2883062Google ScholarDigital Library
- [19] . 2022. An ensemble method for radicalization and hate speech detection online empowered by sentic computing. Cognitive Computation 14, 1 (Feb. 2022), 48–61. https://doi.org/10.1007/s12559-021-09845-6Google ScholarCross Ref
- [20] . 2018. Detecting hate speech on Twitter using a convolution-GRU based deep neural network. In European Semantic Web Conference. Springer, Cham, 745–760. https://doi.org/10.1007/978-3-319-93417-4_48Google ScholarDigital Library
- [21] . 2021. To ban or not to ban: Bayesian attention networks for reliable hate speech detection. Cognitive Computation (Jan. 2021). 1–19. https://doi.org/10.1007/s12559-021-09826-9Google Scholar
- [22] . 2020. A multilingual evaluation for online hate speech detection. ACM Transactions on Internet Technology (TOIT) 20, 2 (May 2020), 1–22. https://doi.org/10.1145/3377323Google ScholarDigital Library
- [23] . 2017. Hate me, hate me not: Hate speech detection on Facebook. In Proceedings of the First Italian Conference on Cybersecurity (ITASEC’17). 86–95.Google Scholar
- [24] . 2021. Aggressive and offensive language identification in Hindi, Bangla, and English: A comparative study. SN Computer Science 2, 1 (Jan. 2021), 1–20. https://doi.org/10.1007/s42979-020-00414-6Google ScholarCross Ref
- [25] . 2022. Cross-lingual few-shot hate speech and offensive language detection using meta learning. IEEE Access 10 (Jan. 2022), 14880–14896. https://doi.org/10.1109/ACCESS.2022.3147588Google ScholarCross Ref
- [26] . 2012. Detecting offensive tweets via topical feature discovery over a large scale Twitter corpus. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 1980–1984. https://doi.org/10.1145/2396761.2398556Google ScholarDigital Library
- [27] . 2020. SOLID: A large-scale semi-supervised dataset for offensive language identification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 915–928Google Scholar
- [28] . 2021. Unsupervised domain adaptation for hate speech detection using a data augmentation approach. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 16. 852–862. https://doi.org/10.1609/icwsm.v16i1.19340Google Scholar
- [29] . 2013. Tracking on-line radicalization using investigative data mining. In 2013 National Conference on Communications (NCC’13). IEEE, 1–5. https://doi.org/10.1109/NCC.2013.6488046Google ScholarCross Ref
- [30] . 2019. Fuzzy multi-task learning for hate speech type identification. In The World Wide Web Conference. 3006–3012. https://doi.org/10.1145/3308558.3313546Google ScholarDigital Library
- [31] . 2018. Fighting offensive language on social media with unsupervised text style transfer. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 189–194. https://doi.org/10.18653/v1/P18-2031Google ScholarCross Ref
- [32] . 2020. Towards a friendly online community: An unsupervised style transfer framework for profanity redaction. In Proceedings of the 28th International Conference on Computational Linguistics. 2107–2114. https://doi.org/10.18653/v1/2020.coling-main.190Google ScholarCross Ref
- [33] . 2019. Automatic detection of hate speech on Facebook using sentiment and emotion analysis. In 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC’19). IEEE, 169–174. https://doi.org/10.1109/ICAIIC.2019.8669073Google ScholarCross Ref
- [34] . 2022. Integrating implicit and explicit linguistic phenomena via multi-task learning for offensive language detection. Knowledge-Based Systems 258 (Dec. 2022), 109965. https://doi.org/10.1016/j.knosys.2022.109965Google ScholarDigital Library
- [35] . 2021. Exploring stylometric and emotion-based features for multilingual cross-domain hate speech detection. In Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 149–159.Google Scholar
- [36] . 2021. Angrybert: Joint learning target and emotion for hate speech detection. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Cham, 701–713. https://doi.org/10.1007/978-3-030-75762-5_55Google ScholarDigital Library
- [37] . 2021. Hate and offensive speech detection in Hindi and Marathi. arXiv:2110.12200. https://doi.org/10.48550/arXiv.2110.12200Google Scholar
- [38] . 2021. Evaluation of deep learning models for hostility detection in Hindi text. In 2021 6th International Conference for Convergence in Technology (I2CT’21). IEEE, 1–5. https://doi.org/10.1109/I2CT51068.2021.9418073Google ScholarCross Ref
- [39] . 2021. Hate speech detection in the Bengali language: A dataset and its baseline evaluation. In Proceedings of International Joint Conference on Advances in Computational Intelligence. Springer, Singapore, 457–468. https://doi.org/10.1007/978-981-16-0586-4_37Google ScholarCross Ref
- [40] . 2021. DeepHateExplainer: Explainable hate speech detection in under-resourced Bengali language. In 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA’21). IEEE, 1–10. https://doi.org/10.1109/DSAA53316.2021.9564230Google ScholarCross Ref
- [41] . 2022. Hate speech detection on Twitter using transfer learning. Computer Speech & Language 74 (July 2022), 101365. https://doi.org/10.1016/j.csl.2022.101365Google ScholarDigital Library
- [42] . 2022. HateCheckHIn: Evaluating Hindi hate speech detection models. arXiv:2205.00328. https://doi.org/10.48550/arXiv.2205.00328Google Scholar
- [43] . 2019. Hateful speech detection in public Facebook pages for the Bengali language. In: 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA’19). IEEE, 555–560. https://doi.org/10.1109/ICMLA.2019.00104Google Scholar
- [44] . 2020. Detecting and visualizing hate speech in social media: A cyber watchdog for surveillance. Expert Systems with Applications 161 (Dec. 2020), 113725. https://doi.org/10.1016/j.eswa.2020.113725Google ScholarCross Ref
- [45] . 2021. Abusive content detection in transliterated Bengali-English social media corpus. In Proceedings of the 5th Workshop on Computational Approaches to Linguistic Code-Switching. 125–130. https://doi.org/10.18653/v1/2021.calcs-1.16Google ScholarCross Ref
- [46] . 2021. Research journey of hate content detection from cyberspace. In Natural Language Processing for Global and Local Business. IGI Global, 200–225. https://doi.org/10.4018/978-1-7998-4240-8.ch009Google ScholarCross Ref
- [47] . 2016. Pyenchant: A Spellchecking Library for Python. Retrieved August 2021 from https://pythonhosted.org/pyenchant.Google Scholar
- [48] . 2020. Googletrans: A Google Translator Library for Python. Retrieved 2020 from https://pythonhosted.org/googletrans.Google Scholar
- [49] Indic Deep-Xlit Engine, AI4Bharat Transliteration Application Library for Python. Retrieved November 2020 from https://pythonhosted.org/ai4bharat-transliteration.Google Scholar
- [50] . 2016. Natural Language Processing: Python and NLTK. Packt Publishing Ltd. https://doi.org/10.5555/3161300Google Scholar
- [51] MEmoLon —The Multilingual Emotion Lexicon. Github. Retrieved March 2021 from https://github.com/JULIELab/MEmoLon.Google Scholar
- [52] . 2018. Hurtlex: A multilingual lexicon of words to hurt. In 5th Italian Conference on Computational Linguistics (CLiC-it’18), Vol. 2253. CEUR-WS, 1–6.Google Scholar
- [53] Hurtlex. Github. Retrieved November 2021 from https://github.com/valeriobasile/hurtlex.Google Scholar
- [54] Viraaj. Hindi Bad Words. Scribd. Retrieved February 18, 2015 from https://www.scribd.com/document/256110319/Hindi-Bad-Words#download.Google Scholar
- [55] . Bengali Slang Words with Meaning (Bengali Slang Dictionary). Academia. Retrieved July 2021 from https://www.academia.edu/2965218/Bengali_slang_words_with_meaning_Bengali_slang_dictionary_.Google Scholar
- [56] . 2006. AnnCorra: Annotating corpora guidelines for POS and Chunk annotation for Indian languages. LTRC-TR31, 1–38.Google Scholar
- [57] HASOC. 2019. Google. Retrieved 2019 from https://hasocfire.github.io/hasoc/2019/index.html.Google Scholar
- [58] . 2020. DHOT-repository and classification of offensive tweets in the Hindi language. Procedia Computer Science 171 (2020), 2324–2333. https://doi.org/10.1016/j.procs.2020.04.252Google ScholarCross Ref
- [59] NNTI Final Project (Sentiment Analysis & Transfer Learning). Github. Retrieved 2021 from GitHub - SouravDutta91/NNTI-WS2021-NLP-Project: Saarland University NNTI WS2021 NLP Final Project.Google Scholar
- [60] . 2019. Towards the development of the Bengali language corpus from public Facebook pages for hate speech research. In Proceedings of the Asian CHI Symposium 2019: Emerging HCI Research Collection. ACM, 141–146. https://doi.org/10.1145/3309700.3338457Google ScholarDigital Library
- [61] . 2020. Developing a multilingual annotated corpus of misogyny and aggression. In Proceedings of the 2nd Workshop on Trolling, Aggression and Cyberbullying. 158–168.Google Scholar
Index Terms
- HateCircle and Unsupervised Hate Speech Detection Incorporating Emotion and Contextual Semantics
Recommendations
A Multilingual Evaluation for Online Hate Speech Detection
Special Section on Emotions in Conflictual Social Interactions and Regular PapersThe increasing popularity of social media platforms such as Twitter and Facebook has led to a rise in the presence of hate and aggressive speech on these platforms. Despite the number of approaches recently proposed in the Natural Language Processing ...
Hate Speech Detection in Roman Urdu
Special issue on Deep Learning for Low-Resource Natural Language Processing, Part 1 and Regular PapersHate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed ...
Improving hate speech detection using Cross-Lingual Learning
AbstractThe growth of social media worldwide has brought social benefits and challenges. One problem we highlight is the proliferation of hate speech on social media. We propose a novel method for detecting hate speech in texts using Cross-Lingual ...
Highlights- The development of a new methodology for hate speech detection.
- Portuguese hate speech detection using Cross-Lingual Learning.
- Up to 20% performance improvement over other models using the OffComBr-2 corpus.
Comments