Skip to main content
Erschienen in: Social Network Analysis and Mining 1/2023

01.12.2023 | Original Article

A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

verfasst von: Anna Glazkova

Erschienen in: Social Network Analysis and Mining | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Preprocessing is a crucial step for each task related to text classification. Preprocessing can have a significant impact on classification performance, but at present there are few large-scale studies evaluating the effectiveness of preprocessing techniques and their combinations. In this work, we explore the impact of 26 widely used text preprocessing techniques on the performance of hate and offensive speech detection algorithms. We evaluate six common machine learning models, such as logistic regression, random forest, linear support vector classifier, convolutional neural network, bidirectional encoder representations from transformers (BERT), and RoBERTa, on four common Twitter benchmarks. Our results show that some preprocessing techniques are useful for improving the accuracy of models while others may even cause a loss of efficiency. In addition, the effectiveness of preprocessing techniques varies depending on the chosen dataset and the classification method. We also explore two ways to combine the techniques that have proved effective during a separate evaluation. Our results show that combining techniques can produce different results. In our experiments, combining techniques works better for traditional machine learning methods than for other methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Alam S, Yao N (2019) The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Comput Math Org Theory 25:319–335CrossRef Alam S, Yao N (2019) The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Comput Math Org Theory 25:319–335CrossRef
Zurück zum Zitat Alfina I, Mulia R, Fanany MI, Ekanata Y (2017) Hate speech detection in the Indonesian language: a dataset and preliminary study. In: 2017 international conference on advanced computer science and information systems (ICACSIS). IEEE, pp 233–238 Alfina I, Mulia R, Fanany MI, Ekanata Y (2017) Hate speech detection in the Indonesian language: a dataset and preliminary study. In: 2017 international conference on advanced computer science and information systems (ICACSIS). IEEE, pp 233–238
Zurück zum Zitat Alonso P, Saini R, Kovacs G (2020) TheNorth at SemEval-2020 task 12: hate speech detection using Roberta. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 2197–2202 Alonso P, Saini R, Kovacs G (2020) TheNorth at SemEval-2020 task 12: hate speech detection using Roberta. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 2197–2202
Zurück zum Zitat Alrehili A (2019) Automatic hate speech detection on social media: a brief survey. In: 2019 IEEE/ACS 16th international conference on computer systems and applications (AICCSA). IEEE, pp 1–6 Alrehili A (2019) Automatic hate speech detection on social media: a brief survey. In: 2019 IEEE/ACS 16th international conference on computer systems and applications (AICCSA). IEEE, pp 1–6
Zurück zum Zitat Alshalan R, Al-Khalifa H (2020) A deep learning approach for automatic hate speech detection in the Saudi Twittersphere. Appl Sci 10(23):8614CrossRef Alshalan R, Al-Khalifa H (2020) A deep learning approach for automatic hate speech detection in the Saudi Twittersphere. Appl Sci 10(23):8614CrossRef
Zurück zum Zitat Ameer I, Siddiqui MHF, Sidorov G, Gelbukh A (2019) CIC at SemEval-2019 task 5: simple yet very efficient approach to hate speech detection, aggressive behavior detection, and target classification in Twitter. In: Proceedings of the 13th international workshop on semantic evaluation, pp 382–386 Ameer I, Siddiqui MHF, Sidorov G, Gelbukh A (2019) CIC at SemEval-2019 task 5: simple yet very efficient approach to hate speech detection, aggressive behavior detection, and target classification in Twitter. In: Proceedings of the 13th international workshop on semantic evaluation, pp 382–386
Zurück zum Zitat Angiani G, Ferrari L, Fontanini T, Fornacciari P, Iotti E, Magliani F, Manicardi S (2016) A comparison between preprocessing techniques for sentiment analysis in twitter. In: KDWeb Angiani G, Ferrari L, Fontanini T, Fornacciari P, Iotti E, Magliani F, Manicardi S (2016) A comparison between preprocessing techniques for sentiment analysis in twitter. In: KDWeb
Zurück zum Zitat Ashraf N, Rafiq A, Butt S, Shehzad HMF, Sidorov G, Gelbukh AF (2022) Youtube based religious hate speech and extremism detection dataset with machine learning baselines. J Intell Fuzzy Syst 42:4769–4777CrossRef Ashraf N, Rafiq A, Butt S, Shehzad HMF, Sidorov G, Gelbukh AF (2022) Youtube based religious hate speech and extremism detection dataset with machine learning baselines. J Intell Fuzzy Syst 42:4769–4777CrossRef
Zurück zum Zitat Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. In: Proceedings of the 26th international conference on world wide web companion, pp 759–760 Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. In: Proceedings of the 26th international conference on world wide web companion, pp 759–760
Zurück zum Zitat Bai Q, Dan Q, Mu Z, Yang M (2019) A systematic review of emoji: current research and future perspectives. Front Psychol 10:2221CrossRef Bai Q, Dan Q, Mu Z, Yang M (2019) A systematic review of emoji: current research and future perspectives. Front Psychol 10:2221CrossRef
Zurück zum Zitat Balouchzahi F, Shashirekha H (2020) Las for hasoc-learning approaches for hate speech and offensive content identification. In: FIRE (working notes), pp 145–151 Balouchzahi F, Shashirekha H (2020) Las for hasoc-learning approaches for hate speech and offensive content identification. In: FIRE (working notes), pp 145–151
Zurück zum Zitat Banerjee S, Sarkar M, Agrawal N, Saha P, Das M (2021) Exploring transformer based models to identify hate speech and offensive content in English and Indo-Aryan languages. arXiv preprint arXiv:2111.13974 Banerjee S, Sarkar M, Agrawal N, Saha P, Das M (2021) Exploring transformer based models to identify hate speech and offensive content in English and Indo-Aryan languages. arXiv preprint arXiv:​2111.​13974
Zurück zum Zitat Baruah A, Barbhuiya F, Dey K (2019) ABARUAH at SemEval-2019 task 5: bi-directional LSTM for hate speech detection. In: Proceedings of the 13th international workshop on semantic evaluation, pp 371–376 Baruah A, Barbhuiya F, Dey K (2019) ABARUAH at SemEval-2019 task 5: bi-directional LSTM for hate speech detection. In: Proceedings of the 13th international workshop on semantic evaluation, pp 371–376
Zurück zum Zitat Basile V, Bosco C, Fersini E, Debora N, Patti V, Pardo FMR, Rosso P, Sanguinetti M (2019) SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in Twitter. In: 13th international workshop on semantic evaluation. Association for Computational Linguistics, pp 54–63 Basile V, Bosco C, Fersini E, Debora N, Patti V, Pardo FMR, Rosso P, Sanguinetti M (2019) SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in Twitter. In: 13th international workshop on semantic evaluation. Association for Computational Linguistics, pp 54–63
Zurück zum Zitat Bhandari A, Shah SB, Thapa S, Naseem U, Nasim M (2023) CrisisHateMM: multimodal analysis of directed and undirected hate speech in text-embedded images from Russia-Ukraine conflict. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops, pp 1993–2002 Bhandari A, Shah SB, Thapa S, Naseem U, Nasim M (2023) CrisisHateMM: multimodal analysis of directed and undirected hate speech in text-embedded images from Russia-Ukraine conflict. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops, pp 1993–2002
Zurück zum Zitat Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 interactive presentation sessions, pp 69–72 Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 interactive presentation sessions, pp 69–72
Zurück zum Zitat Bölücü N, Canbay P (2021) Hate speech and offensive content identification with graph convolutional networks. In: Forum for information retrieval evaluation (working notes)(FIRE), CEUR-WS.org, pp 44–51 Bölücü N, Canbay P (2021) Hate speech and offensive content identification with graph convolutional networks. In: Forum for information retrieval evaluation (working notes)(FIRE), CEUR-WS.org, pp 44–51
Zurück zum Zitat Caselli T, Basile V, Mitrović J, Granitzer M (2021) HateBERT: retraining BERT for abusive language detection in English. In: Proceedings of the 5th workshop on online abuse and harms (WOAH 2021), pp 17–25 Caselli T, Basile V, Mitrović J, Granitzer M (2021) HateBERT: retraining BERT for abusive language detection in English. In: Proceedings of the 5th workshop on online abuse and harms (WOAH 2021), pp 17–25
Zurück zum Zitat Caselli T, Basile V, Mitrović J, Kartoziya I, Granitzer M (2020) I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language. In: Proceedings of the 12th language resources and evaluation conference, pp 6193–6202 Caselli T, Basile V, Mitrović J, Kartoziya I, Granitzer M (2020) I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language. In: Proceedings of the 12th language resources and evaluation conference, pp 6193–6202
Zurück zum Zitat Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451 Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451
Zurück zum Zitat Das AK, Al Asif A, Paul A, Hossain MN (2021) Bangla hate speech detection on social media using attention-based recurrent neural network. J Intell Syst 30(1):578–591 Das AK, Al Asif A, Paul A, Hossain MN (2021) Bangla hate speech detection on social media using attention-based recurrent neural network. J Intell Syst 30(1):578–591
Zurück zum Zitat Davidson T, Bhattacharya D, Weber I (2019) Racial bias in hate speech and abusive language detection datasets. In: Proceedings of the third workshop on abusive language online, pp 25–35 Davidson T, Bhattacharya D, Weber I (2019) Racial bias in hate speech and abusive language detection datasets. In: Proceedings of the third workshop on abusive language online, pp 25–35
Zurück zum Zitat Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Proceedings of the international AAAI conference on web and social media, vol 11, pp 512–515 Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Proceedings of the international AAAI conference on web and social media, vol 11, pp 512–515
Zurück zum Zitat Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186 Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186
Zurück zum Zitat Do HT-T, Huynh HD, Van Nguyen K, Nguyen NL-T, Nguyen AG-T (2019) Hate speech detection on Vietnamese social media text using the bidirectional-LSTM model. In: The sixth international workshop on Vietnamese language and speech processing VLSP 2019 Do HT-T, Huynh HD, Van Nguyen K, Nguyen NL-T, Nguyen AG-T (2019) Hate speech detection on Vietnamese social media text using the bidirectional-LSTM model. In: The sixth international workshop on Vietnamese language and speech processing VLSP 2019
Zurück zum Zitat Dogru HB, Tilki S, Jamil A, Hameed AA (2021) Deep learning-based classification of news texts using Doc2vec model. In: 2021 1st international conference on artificial intelligence and data analytics (CAIDA). IEEE, pp 91–96 Dogru HB, Tilki S, Jamil A, Hameed AA (2021) Deep learning-based classification of news texts using Doc2vec model. In: 2021 1st international conference on artificial intelligence and data analytics (CAIDA). IEEE, pp 91–96
Zurück zum Zitat Fersini E, Nozza D, Rosso P (2018) Overview of the Evalita 2018 task on automatic misogyny identification (AMI). In: CEUR workshop proceedings. CEUR-WS, vol 2263, pp 1–9 Fersini E, Nozza D, Rosso P (2018) Overview of the Evalita 2018 task on automatic misogyny identification (AMI). In: CEUR workshop proceedings. CEUR-WS, vol 2263, pp 1–9
Zurück zum Zitat Fersini E, Rosso P, Anzovino M (2018) Overview of the task on automatic misogyny identification at IberEval 2018. In: CEUR workshop proceedings. CEUR-WS, vol 2150, pp 214–228 Fersini E, Rosso P, Anzovino M (2018) Overview of the task on automatic misogyny identification at IberEval 2018. In: CEUR workshop proceedings. CEUR-WS, vol 2150, pp 214–228
Zurück zum Zitat Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput Surv CSUR 51(4):1–30 Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput Surv CSUR 51(4):1–30
Zurück zum Zitat Fromknecht J, Palmer A (2020) UNT linguistics at SemEval-2020 task 12: linear SVC with pre-trained word embeddings as document vectors and targeted linguistic features. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 2209–2215 Fromknecht J, Palmer A (2020) UNT linguistics at SemEval-2020 task 12: linear SVC with pre-trained word embeddings as document vectors and targeted linguistic features. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 2209–2215
Zurück zum Zitat Garain A, Basu A (2019) The titans at SemEval-2019 task 5: detection of hate speech against immigrants and women in Twitter. In: Proceedings of the 13th international workshop on semantic evaluation, pp 494–497 Garain A, Basu A (2019) The titans at SemEval-2019 task 5: detection of hate speech against immigrants and women in Twitter. In: Proceedings of the 13th international workshop on semantic evaluation, pp 494–497
Zurück zum Zitat Garouani M, Chrita H, Kharroubi J (2021) Sentiment analysis of Moroccan tweets using text mining. In: Digital technologies and applications: proceedings of ICDTA 21, Fez, Morocco. Springer, pp 597–608 Garouani M, Chrita H, Kharroubi J (2021) Sentiment analysis of Moroccan tweets using text mining. In: Digital technologies and applications: proceedings of ICDTA 21, Fez, Morocco. Springer, pp 597–608
Zurück zum Zitat Glazkova A, Kadantsev M, Glazkov M (2021) Fine-tuning of pre-trained transformers for hate, offensive, and profane content detection in English and Marathi. In: FIRE 2021 working notes, pp 52–62 Glazkova A, Kadantsev M, Glazkov M (2021) Fine-tuning of pre-trained transformers for hate, offensive, and profane content detection in English and Marathi. In: FIRE 2021 working notes, pp 52–62
Zurück zum Zitat Guibon G, Ochs M, Bellot P (2016) From emojis to sentiment analysis. In: WACAI 2016 Guibon G, Ochs M, Bellot P (2016) From emojis to sentiment analysis. In: WACAI 2016
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
Zurück zum Zitat Huang X, Xing L, Dernoncourt F, Paul MJ (2020) Multilingual twitter corpus and baselines for evaluating demographic bias in hate speech recognition. In: LREC Huang X, Xing L, Dernoncourt F, Paul MJ (2020) Multilingual twitter corpus and baselines for evaluating demographic bias in hate speech recognition. In: LREC
Zurück zum Zitat Hu R, Dorris W, Vishwamitra N, Luo F, Costello M (2020) On the impact of word representation in hate speech and offensive language detection and explanation. In: Proceedings of the tenth ACM conference on data and application security and privacy, pp 171–173 Hu R, Dorris W, Vishwamitra N, Luo F, Costello M (2020) On the impact of word representation in hate speech and offensive language detection and explanation. In: Proceedings of the tenth ACM conference on data and application security and privacy, pp 171–173
Zurück zum Zitat Jianqiang Z, Xiaolin G (2017) Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 5:2870–2879CrossRef Jianqiang Z, Xiaolin G (2017) Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 5:2870–2879CrossRef
Zurück zum Zitat Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:​1607.​01759
Zurück zum Zitat Kadhim AI (2018) An evaluation of preprocessing techniques for text classification. Int J Comput Sci Inf Secur IJCSIS 16(6):22–32 Kadhim AI (2018) An evaluation of preprocessing techniques for text classification. Int J Comput Sci Inf Secur IJCSIS 16(6):22–32
Zurück zum Zitat Kaibi I, Satori H (2019) A comparative evaluation of word embeddings techniques for twitter sentiment analysis. In: 2019 international conference on wireless technologies, embedded and intelligent systems (WITS). IEEE, pp 1–4 Kaibi I, Satori H (2019) A comparative evaluation of word embeddings techniques for twitter sentiment analysis. In: 2019 international conference on wireless technologies, embedded and intelligent systems (WITS). IEEE, pp 1–4
Zurück zum Zitat Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1746–1751 Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1746–1751
Zurück zum Zitat Kirk H, Yin W, Vidgen B, Röttger P (2023) SemEval-2023 task 10: explainable detection of online sexism. In: Proceedings of the 17th international workshop on semantic evaluation (SemEval-2023). Association for Computational Linguistics, Toronto, Canada, pp 2193–2210. https://aclanthology.org/2023.semeval-1.305 Kirk H, Yin W, Vidgen B, Röttger P (2023) SemEval-2023 task 10: explainable detection of online sexism. In: Proceedings of the 17th international workshop on semantic evaluation (SemEval-2023). Association for Computational Linguistics, Toronto, Canada, pp 2193–2210. https://​aclanthology.​org/​2023.​semeval-1.​305
Zurück zum Zitat Kodali P, Bhatnagar A, Ahuja N, Shrivastava M, Kumaraguru P (2022) HashSet—a dataset for hashtag segmentation. arXiv preprint arXiv:2201.06741 Kodali P, Bhatnagar A, Ahuja N, Shrivastava M, Kumaraguru P (2022) HashSet—a dataset for hashtag segmentation. arXiv preprint arXiv:​2201.​06741
Zurück zum Zitat Krouska A, Troussas C, Virvou M (2016) The effect of preprocessing techniques on twitter sentiment analysis. In: 2016 7th international conference on information, intelligence, systems & applications (IISA). IEEE, pp 1–5 Krouska A, Troussas C, Virvou M (2016) The effect of preprocessing techniques on twitter sentiment analysis. In: 2016 7th international conference on information, intelligence, systems & applications (IISA). IEEE, pp 1–5
Zurück zum Zitat Liao W, Zeng B, Liu J, Wei P, Cheng X, Zhang W (2021) Multi-level graph neural network for text sentiment analysis. Comput Electr Eng 92:107096CrossRef Liao W, Zeng B, Liu J, Wei P, Cheng X, Zhang W (2021) Multi-level graph neural network for text sentiment analysis. Comput Electr Eng 92:107096CrossRef
Zurück zum Zitat Li M, Liao S, Okpala E, Tong M, Costello M, Cheng L, Hu H, Luo F (2021) COVID-hateBERT: a pre-trained language model for COVID-19 related hate speech detection. In: 2021 20th IEEE international conference on machine learning and applications (ICMLA), pp 233–238. IEEE Li M, Liao S, Okpala E, Tong M, Costello M, Cheng L, Hu H, Luo F (2021) COVID-hateBERT: a pre-trained language model for COVID-19 related hate speech detection. In: 2021 20th IEEE international conference on machine learning and applications (ICMLA), pp 233–238. IEEE
Zurück zum Zitat Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:​1907.​11692
Zurück zum Zitat Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations
Zurück zum Zitat Luu ST, Nguyen HP, Van Nguyen K, Nguyen NL-T (2020) Comparison between traditional machine learning models and neural network models for Vietnamese hate speech detection. In: 2020 RIVF international conference on computing and communication technologies (RIVF). IEEE, pp 1–6 Luu ST, Nguyen HP, Van Nguyen K, Nguyen NL-T (2020) Comparison between traditional machine learning models and neural network models for Vietnamese hate speech detection. In: 2020 RIVF international conference on computing and communication technologies (RIVF). IEEE, pp 1–6
Zurück zum Zitat MacAvaney S, Yao H-R, Yang E, Russell K, Goharian N, Frieder O (2019) Hate speech detection: challenges and solutions. PLoS ONE 14(8):0221152CrossRef MacAvaney S, Yao H-R, Yang E, Russell K, Goharian N, Frieder O (2019) Hate speech detection: challenges and solutions. PLoS ONE 14(8):0221152CrossRef
Zurück zum Zitat Mandl T, Modha S, Kumar M A, Chakravarthi BR (2020) Overview of the HASOC track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Forum for information retrieval evaluation, pp 29–32 Mandl T, Modha S, Kumar M A, Chakravarthi BR (2020) Overview of the HASOC track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Forum for information retrieval evaluation, pp 29–32
Zurück zum Zitat Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel A (2019) Overview of the HASOC track at fire 2019: hate speech and offensive content identification in Indo-European languages. In: Proceedings of the 11th forum for information retrieval evaluation, pp 14–17 Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel A (2019) Overview of the HASOC track at fire 2019: hate speech and offensive content identification in Indo-European languages. In: Proceedings of the 11th forum for information retrieval evaluation, pp 14–17
Zurück zum Zitat Menini S, Aprosio AP, Tonelli S (2021) Abuse is contextual, what about NLP? The role of context in abusive language annotation and detection. arXiv preprint arXiv:2103.14916 Menini S, Aprosio AP, Tonelli S (2021) Abuse is contextual, what about NLP? The role of context in abusive language annotation and detection. arXiv preprint arXiv:​2103.​14916
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781
Zurück zum Zitat Mishra AK, Saumya S, Kumar A (2020) Iiit_dwd@hasoc 2020: identifying offensive content in Indo-European languages. In: FIRE (working notes), pp 139–144 Mishra AK, Saumya S, Kumar A (2020) Iiit_dwd@hasoc 2020: identifying offensive content in Indo-European languages. In: FIRE (working notes), pp 139–144
Zurück zum Zitat Modha S, Mandl T, Majumder P, Satapara S, Patel T, Madhu H (2022) Overview of the HASOC subtrack at fire 2022: identification of conversational hate-speech in Hindi-English code-mixed and German language. Working notes of FIRE Modha S, Mandl T, Majumder P, Satapara S, Patel T, Madhu H (2022) Overview of the HASOC subtrack at fire 2022: identification of conversational hate-speech in Hindi-English code-mixed and German language. Working notes of FIRE
Zurück zum Zitat Modha S, Mandl T, Shahi GK, Madhu H, Satapara S, Ranasinghe T, Zampieri M (2021) Overview of the HASOC subtrack at fire 2021: hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech. In: Forum for information retrieval evaluation, pp 1–3 Modha S, Mandl T, Shahi GK, Madhu H, Satapara S, Ranasinghe T, Zampieri M (2021) Overview of the HASOC subtrack at fire 2021: hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech. In: Forum for information retrieval evaluation, pp 1–3
Zurück zum Zitat Mohammad F (2018) Is preprocessing of text really worth your time for toxic comment classification? In: Proceedings on the international conference on artificial intelligence (ICAI). The Steering Committee of The World Congress in Computer Science, Computer, pp 447–453 Mohammad F (2018) Is preprocessing of text really worth your time for toxic comment classification? In: Proceedings on the international conference on artificial intelligence (ICAI). The Steering Committee of The World Congress in Computer Science, Computer, pp 447–453
Zurück zum Zitat Montejo-Ráez A, Jiménez-Zafra SM, Garcia-Cumbreras MA, Díaz-Galiano MC (2019) SINAI-DL at SemEval-2019 task 5: recurrent networks and data augmentation by paraphrasing. In: Proceedings of the 13th international workshop on semantic evaluation, pp 480–483 Montejo-Ráez A, Jiménez-Zafra SM, Garcia-Cumbreras MA, Díaz-Galiano MC (2019) SINAI-DL at SemEval-2019 task 5: recurrent networks and data augmentation by paraphrasing. In: Proceedings of the 13th international workshop on semantic evaluation, pp 480–483
Zurück zum Zitat Naseem U, Razzak I, Hameed IA (2019) Deep context-aware embedding for abusive and hate speech detection on twitter. Aust J Intell Inf Process Syst 15(3):69–76 Naseem U, Razzak I, Hameed IA (2019) Deep context-aware embedding for abusive and hate speech detection on twitter. Aust J Intell Inf Process Syst 15(3):69–76
Zurück zum Zitat Naseem U, Razzak I, Eklund PW (2021) A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools Appl 80(28):35239–35266CrossRef Naseem U, Razzak I, Eklund PW (2021) A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools Appl 80(28):35239–35266CrossRef
Zurück zum Zitat Nobata C, Tetreault J, Thomas A, Mehdad Y, Chang Y (2016) Abusive language detection in online user content. In: Proceedings of the 25th international conference on world wide web, pp 145–153 Nobata C, Tetreault J, Thomas A, Mehdad Y, Chang Y (2016) Abusive language detection in online user content. In: Proceedings of the 25th international conference on world wide web, pp 145–153
Zurück zum Zitat Nugroho K, Noersasongko E, Fanani AZ, Basuki RS (2019) Improving random forest method to detect hatespeech and offensive word. In: 2019 international conference on information and communications technology (ICOIACT). IEEE, pp 514–518 Nugroho K, Noersasongko E, Fanani AZ, Basuki RS (2019) Improving random forest method to detect hatespeech and offensive word. In: 2019 international conference on information and communications technology (ICOIACT). IEEE, pp 514–518
Zurück zum Zitat Oliveira DN, Merschmann LHDC (2021) Joint evaluation of preprocessing tasks with classifiers for sentiment analysis in Brazilian Portuguese language. Multimedia Tools Appl 80:15391–15412CrossRef Oliveira DN, Merschmann LHDC (2021) Joint evaluation of preprocessing tasks with classifiers for sentiment analysis in Brazilian Portuguese language. Multimedia Tools Appl 80:15391–15412CrossRef
Zurück zum Zitat Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32 Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32
Zurück zum Zitat Pavlopoulos J, Sorensen J, Laugier L, Androutsopoulos I (2021) SemEval-2021 task 5: toxic spans detection. In: Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021), pp 59–69 Pavlopoulos J, Sorensen J, Laugier L, Androutsopoulos I (2021) SemEval-2021 task 5: toxic spans detection. In: Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021), pp 59–69
Zurück zum Zitat Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNet Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNet
Zurück zum Zitat Pennington J, Socher R, Manning C.D (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning C.D (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Zurück zum Zitat Plaza-Del-Arco FM, Molina-González MD, Ureña-López LA, Martín-Valdivia MT (2021) A multi-task learning approach to hate speech detection leveraging sentiment analysis. IEEE Access 9:112478–112489CrossRef Plaza-Del-Arco FM, Molina-González MD, Ureña-López LA, Martín-Valdivia MT (2021) A multi-task learning approach to hate speech detection leveraging sentiment analysis. IEEE Access 9:112478–112489CrossRef
Zurück zum Zitat Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V (2021) Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Eval 55(2):477–523CrossRef Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V (2021) Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Eval 55(2):477–523CrossRef
Zurück zum Zitat Porter MF (2001) Snowball: a language for stemming algorithms Porter MF (2001) Snowball: a language for stemming algorithms
Zurück zum Zitat Ramachandran D, Parvathi R (2019) Analysis of twitter specific preprocessing technique for tweets. Procedia Comput Sci 165:245–251CrossRef Ramachandran D, Parvathi R (2019) Analysis of twitter specific preprocessing technique for tweets. Procedia Comput Sci 165:245–251CrossRef
Zurück zum Zitat Ranasinghe T, Hettiarachchi H (2020) Brums at SemEval-2020 task 12: transformer based multilingual offensive language identification in social media. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1906–1915 Ranasinghe T, Hettiarachchi H (2020) Brums at SemEval-2020 task 12: transformer based multilingual offensive language identification in social media. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1906–1915
Zurück zum Zitat Renault T (2020) Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digit Finance 2(1–2):1–13CrossRef Renault T (2020) Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digit Finance 2(1–2):1–13CrossRef
Zurück zum Zitat Reuter J, Pereira-Martins J, Kalita J (2016) Segmenting Twitter hashtags. Int J Nat Lang Comput 5(4):23–36CrossRef Reuter J, Pereira-Martins J, Kalita J (2016) Segmenting Twitter hashtags. Int J Nat Lang Comput 5(4):23–36CrossRef
Zurück zum Zitat Rogers A, Kovaleva O, Rumshisky A (2020) A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Ling 8:842–866 Rogers A, Kovaleva O, Rumshisky A (2020) A primer in BERTology: what we know about how BERT works. Trans Assoc Comput Ling 8:842–866
Zurück zum Zitat Saeed AM, Ismael AN, Rasul DL, Majeed RS, Rashid TA (2022) Hate speech detection in social media for the Kurdish language. In: Proceedings of the ICR’22 international conference on innovations in computing research. Springer, pp 253–260 Saeed AM, Ismael AN, Rasul DL, Majeed RS, Rashid TA (2022) Hate speech detection in social media for the Kurdish language. In: Proceedings of the ICR’22 international conference on innovations in computing research. Springer, pp 253–260
Zurück zum Zitat Saeed NM, Helal NA, Badr NL, Gharib TF (2018) The impact of spam reviews on feature-based sentiment analysis. In: 2018 13th international conference on computer engineering and systems (ICCES). IEEE, pp 633–639 Saeed NM, Helal NA, Badr NL, Gharib TF (2018) The impact of spam reviews on feature-based sentiment analysis. In: 2018 13th international conference on computer engineering and systems (ICCES). IEEE, pp 633–639
Zurück zum Zitat Saeed NM, Helal NA, Badr NL, Gharib TF (2020) An enhanced feature-based sentiment analysis approach. Wiley Interdiscip Rev Data Min Knowl Discov 10(2):1347CrossRef Saeed NM, Helal NA, Badr NL, Gharib TF (2020) An enhanced feature-based sentiment analysis approach. Wiley Interdiscip Rev Data Min Knowl Discov 10(2):1347CrossRef
Zurück zum Zitat Saeed RM, Rady S, Gharib TF (2021) Optimizing sentiment classification for Arabic opinion texts. Cogn Comput 13(1):164–178CrossRef Saeed RM, Rady S, Gharib TF (2021) Optimizing sentiment classification for Arabic opinion texts. Cogn Comput 13(1):164–178CrossRef
Zurück zum Zitat Saeed RM, Rady S, Gharib TF (2022) An ensemble approach for spam detection in Arabic opinion texts. J King Saud Univ Comput Inf Sci 34(1):1407–1416 Saeed RM, Rady S, Gharib TF (2022) An ensemble approach for spam detection in Arabic opinion texts. J King Saud Univ Comput Inf Sci 34(1):1407–1416
Zurück zum Zitat Schmidt A, Wiegand M (2019) A survey on hate speech detection using natural language processing. In: Proceedings of the fifth international workshop on natural language processing for social media, April 3, 2017, Valencia, Spain. Association for Computational Linguistics, pp 1–10 Schmidt A, Wiegand M (2019) A survey on hate speech detection using natural language processing. In: Proceedings of the fifth international workshop on natural language processing for social media, April 3, 2017, Valencia, Spain. Association for Computational Linguistics, pp 1–10
Zurück zum Zitat Silva SC, Ferreira TC, Ramos RMS, Paraboni I (2020) Data driven and psycholinguistics motivated approaches to hate speech detection. Computación y Sistemas 24 Silva SC, Ferreira TC, Ramos RMS, Paraboni I (2020) Data driven and psycholinguistics motivated approaches to hate speech detection. Computación y Sistemas 24
Zurück zum Zitat Štrimaitis R, Stefanovič P, Ramanauskaitė S, Slotkienė A (2021) Financial context news sentiment analysis for the Lithuanian language. Appl Sci 11(10):4443CrossRef Štrimaitis R, Stefanovič P, Ramanauskaitė S, Slotkienė A (2021) Financial context news sentiment analysis for the Lithuanian language. Appl Sci 11(10):4443CrossRef
Zurück zum Zitat Symeonidis S, Effrosynidis D, Arampatzis A (2018) A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst Appl 110:298–310CrossRef Symeonidis S, Effrosynidis D, Arampatzis A (2018) A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst Appl 110:298–310CrossRef
Zurück zum Zitat Thapa S, Jafri FA, Hürriyetoğlu A, Vargas F, Lee RK-W, Naseem U (2023) Multimodal hate speech event detection—shared task 4, CASE 2023. In: Proceedings of the 6th workshop on challenges and applications of automated extraction of socio-political events from text (CASE) Thapa S, Jafri FA, Hürriyetoğlu A, Vargas F, Lee RK-W, Naseem U (2023) Multimodal hate speech event detection—shared task 4, CASE 2023. In: Proceedings of the 6th workshop on challenges and applications of automated extraction of socio-political events from text (CASE)
Zurück zum Zitat Toraman C, Şahinuç F, Yılmaz EH (2022) Large-scale hate speech detection with cross-domain transfer. arXiv preprint arXiv:2203.01111 Toraman C, Şahinuç F, Yılmaz EH (2022) Large-scale hate speech detection with cross-domain transfer. arXiv preprint arXiv:​2203.​01111
Zurück zum Zitat Wallace E, Wang Y, Li S, Singh S, Gardner M (2019) Do NLP models know numbers? Probing numeracy in embeddings. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 5307–5315 Wallace E, Wang Y, Li S, Singh S, Gardner M (2019) Do NLP models know numbers? Probing numeracy in embeddings. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 5307–5315
Zurück zum Zitat Wang B, Ding Y, Liu S, Zhou X (2019) Ynu_wb at HASOC 2019: ordered neurons LSTM with attention for identifying hate speech and offensive language. In: FIRE (working notes), pp 191–198 Wang B, Ding Y, Liu S, Zhou X (2019) Ynu_wb at HASOC 2019: ordered neurons LSTM with attention for identifying hate speech and offensive language. In: FIRE (working notes), pp 191–198
Zurück zum Zitat Wang S, Liu J, Ouyang X, Sun Y (2020) Galileo at SemEval-2020 task 12: multi-lingual learning for offensive language identification using pre-trained language models. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1448–1455 Wang S, Liu J, Ouyang X, Sun Y (2020) Galileo at SemEval-2020 task 12: multi-lingual learning for offensive language identification using pre-trained language models. In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1448–1455
Zurück zum Zitat Wang D, Liu P, Zheng Y, Qiu X, Huang X-J (2020) Heterogeneous graph neural networks for extractive document summarization. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6209–6219 Wang D, Liu P, Zheng Y, Qiu X, Huang X-J (2020) Heterogeneous graph neural networks for extractive document summarization. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6209–6219
Zurück zum Zitat Wiedemann G, Yimam SM, Biemann C (2020) UHH-LT at SemEval-2020 task 12: fine-tuning of pre-trained transformer networks for offensive language detection. arXiv preprint arXiv:2004.11493 Wiedemann G, Yimam SM, Biemann C (2020) UHH-LT at SemEval-2020 task 12: fine-tuning of pre-trained transformer networks for offensive language detection. arXiv preprint arXiv:​2004.​11493
Zurück zum Zitat Wiegand M, Ruppenhofer J, Kleinbauer T (2019) Detection of abusive language: the problem of biased datasets. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers), pp 602–608 Wiegand M, Ruppenhofer J, Kleinbauer T (2019) Detection of abusive language: the problem of biased datasets. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers), pp 602–608
Zurück zum Zitat Yin W, Zubiaga A (2021) Towards generalisable hate speech detection: a review on obstacles and solutions. PeerJ Comput Sci 7:598CrossRef Yin W, Zubiaga A (2021) Towards generalisable hate speech detection: a review on obstacles and solutions. PeerJ Comput Sci 7:598CrossRef
Zurück zum Zitat Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) SemEval-2019 task 6: identifying and categorizing offensive language in social media (offenseval). In: Proceedings of the 13th international workshop on semantic evaluation, pp 75–86 Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) SemEval-2019 task 6: identifying and categorizing offensive language in social media (offenseval). In: Proceedings of the 13th international workshop on semantic evaluation, pp 75–86
Zurück zum Zitat Zampieri M, Nakov P, Rosenthal S, Atanasova P, Karadzhov G, Mubarak H, Derczynski L, Pitenis Z, Çöltekin Ç (2020) SemEval-2020 task 12: multilingual offensive language identification in social media (offenseval 2020). In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1425–1447 Zampieri M, Nakov P, Rosenthal S, Atanasova P, Karadzhov G, Mubarak H, Derczynski L, Pitenis Z, Çöltekin Ç (2020) SemEval-2020 task 12: multilingual offensive language identification in social media (offenseval 2020). In: Proceedings of the fourteenth workshop on semantic evaluation, pp 1425–1447
Zurück zum Zitat Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q (2019) ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 1441–1451 Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q (2019) ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 1441–1451
Zurück zum Zitat Zhou Y, Yang Y, Liu H, Liu X, Savage N (2020) Deep learning based fusion approach for hate speech detection. IEEE Access 8:128923–128929CrossRef Zhou Y, Yang Y, Liu H, Liu X, Savage N (2020) Deep learning based fusion approach for hate speech detection. IEEE Access 8:128923–128929CrossRef
Zurück zum Zitat Zhou X, Yong Y, Fan X, Ren G, Song Y, Diao Y, Yang L, Lin H (2021) Hate speech detection based on sentiment knowledge sharing. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pp 7158–7166 Zhou X, Yong Y, Fan X, Ren G, Song Y, Diao Y, Yang L, Lin H (2021) Hate speech detection based on sentiment knowledge sharing. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pp 7158–7166
Metadaten
Titel
A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter
verfasst von
Anna Glazkova
Publikationsdatum
01.12.2023
Verlag
Springer Vienna
Erschienen in
Social Network Analysis and Mining / Ausgabe 1/2023
Print ISSN: 1869-5450
Elektronische ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-023-01156-y

Weitere Artikel der Ausgabe 1/2023

Social Network Analysis and Mining 1/2023 Zur Ausgabe

Premium Partner