Skip to main content
Erschienen in: Social Network Analysis and Mining 1/2022

01.12.2022 | Original Article

Predicting the type and target of offensive social media posts in Marathi

verfasst von: Marcos Zampieri, Tharindu Ranasinghe, Mrinal Chaudhari, Saurabh Gaikwad, Prajwal Krishna, Mayuresh Nene, Shrunali Paygude

Erschienen in: Social Network Analysis and Mining | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high-resource languages such as French, German, and Spanish. In this paper, we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID (Rosenthal et al. in SOLID: a large-scale semi-supervised dataset for offensive language identification. In: Findings of ACL, 2021).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
2
Tweepy Python library documentation is available on https://​www.​tweepy.​org/​.
 
3
Marathi FastText embeddings are available on https://​fasttext.​cc/​docs/​en/​crawl-vectors.​html.
 
5
DeepOffense is available as a pip package in https://​pypi.​org/​project/​deepoffense/​.
 
Literatur
Zurück zum Zitat Alakrot A, Murray L, Nikolov NS (2018) Towards accurate detection of offensive language in online communication in arabic. Procedia Comput Sci 142:315–320CrossRef Alakrot A, Murray L, Nikolov NS (2018) Towards accurate detection of offensive language in online communication in arabic. Procedia Comput Sci 142:315–320CrossRef
Zurück zum Zitat Aroyehun ST, Gelbukh A (2018) Aggression detection in social media: using deep neural networks, data augmentation, and pseudo labeling. In: Proceedings of TRAC Aroyehun ST, Gelbukh A (2018) Aggression detection in social media: using deep neural networks, data augmentation, and pseudo labeling. In: Proceedings of TRAC
Zurück zum Zitat Basile V, Bosco C, Fersini E, Nozza D, Patti V, Pardo FMR, Rosso P, Sanguinetti M (2019) Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of SemEval Basile V, Bosco C, Fersini E, Nozza D, Patti V, Pardo FMR, Rosso P, Sanguinetti M (2019) Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of SemEval
Zurück zum Zitat Bassignana E, Basile V, Patti V ( 2018) Hurtlex: a multilingual lexicon of words to hurt. In: Proceedings of CliC-It Bassignana E, Basile V, Patti V ( 2018) Hurtlex: a multilingual lexicon of words to hurt. In: Proceedings of CliC-It
Zurück zum Zitat Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:1 Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:1
Zurück zum Zitat Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Linguist 22(2):249–254 Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Linguist 22(2):249–254
Zurück zum Zitat Chiril P, Benamara Zitoune F, Moriceau V, Coulomb-Gully M, Kumar A ( 2019) Multilingual and multitarget hate speech detection in tweets. In: Proceedings of TALN Chiril P, Benamara Zitoune F, Moriceau V, Coulomb-Gully M, Kumar A ( 2019) Multilingual and multitarget hate speech detection in tweets. In: Proceedings of TALN
Zurück zum Zitat Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. In: Proceedings of ACL Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. In: Proceedings of ACL
Zurück zum Zitat Çöltekin c (2020) A Corpus of Turkish Offensive Language on Social Media. In: Proceedings of LREC Çöltekin c (2020) A Corpus of Turkish Offensive Language on Social Media. In: Proceedings of LREC
Zurück zum Zitat Dadvar M, Trieschnigg D, Ordelman R, de Jong F (2013) Improving dyberbullying detection with user context. In: Proceedings of ECIR, Dadvar M, Trieschnigg D, Ordelman R, de Jong F (2013) Improving dyberbullying detection with user context. In: Proceedings of ECIR,
Zurück zum Zitat Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL
Zurück zum Zitat Fišer D, Erjavec T, Ljubešić N (2017) Legal framework, dataset and annotation schema for socially unacceptable on-line discourse practices in Slovene. In: Proceedings ALW Fišer D, Erjavec T, Ljubešić N (2017) Legal framework, dataset and annotation schema for socially unacceptable on-line discourse practices in Slovene. In: Proceedings ALW
Zurück zum Zitat Fortuna P, da Silva JR, Wanner L, Nunes S, et al ( 2019) A hierarchically-labeled portuguese hate speech dataset. In: Proceedings of ALW Fortuna P, da Silva JR, Wanner L, Nunes S, et al ( 2019) A hierarchically-labeled portuguese hate speech dataset. In: Proceedings of ALW
Zurück zum Zitat Gaikwad SS, Ranasinghe T, Zampieri M, Homan C ( 2021) Cross-lingual offensive language identification for low resource languages: the case of Marathi. In: Proceedings of RANLP Gaikwad SS, Ranasinghe T, Zampieri M, Homan C ( 2021) Cross-lingual offensive language identification for low resource languages: the case of Marathi. In: Proceedings of RANLP
Zurück zum Zitat Ghadery E, Moens M-F (2020) LIIR at semeval-2020 task 12: a cross-lingual augmentation approach for multilingual offensive language identification. Proceedings of SemEval Ghadery E, Moens M-F (2020) LIIR at semeval-2020 task 12: a cross-lingual augmentation approach for multilingual offensive language identification. Proceedings of SemEval
Zurück zum Zitat Goudjil M, Koudil M, Bedda M, Ghoggali N (2018) A novel active learning method using svm for text classification. Int J Autom Comput 15(3):290–298CrossRef Goudjil M, Koudil M, Bedda M, Ghoggali N (2018) A novel active learning method using svm for text classification. Int J Autom Comput 15(3):290–298CrossRef
Zurück zum Zitat Hettiarachchi H, Ranasinghe T (2019) Emoji powered capsule network to detect type and target of offensive posts in social media. In: Proceedings of RANLP Hettiarachchi H, Ranasinghe T (2019) Emoji powered capsule network to detect type and target of offensive posts in social media. In: Proceedings of RANLP
Zurück zum Zitat Kakwani D, Kunchukuttan A, Golla S, NC G, Bhattacharyya A, Khapra MM, Kumar P ( 2020) IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020 Kakwani D, Kunchukuttan A, Golla S, NC G, Bhattacharyya A, Khapra MM, Kumar P ( 2020) IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020
Zurück zum Zitat Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of EMNLP Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of EMNLP
Zurück zum Zitat Kumar R, Ojha AK, Malmasi S, Zampieri M ( 2020) Evaluating aggression identification in social media. In: Proceedings of TRAC Kumar R, Ojha AK, Malmasi S, Zampieri M ( 2020) Evaluating aggression identification in social media. In: Proceedings of TRAC
Zurück zum Zitat Kumar R, Ojha AK, Malmasi S, Zampieri M (2018) Benchmarking aggression identification in social media. In: Proceedings of TRAC Kumar R, Ojha AK, Malmasi S, Zampieri M (2018) Benchmarking aggression identification in social media. In: Proceedings of TRAC
Zurück zum Zitat Kumar S, Kumar S, Kanojia D, Bhattacharyya,P (2020) A passage to India: Pre-trained word embeddings for Indian languages. In: Proceedings of SLTU Kumar S, Kumar S, Kanojia D, Bhattacharyya,P (2020) A passage to India: Pre-trained word embeddings for Indian languages. In: Proceedings of SLTU
Zurück zum Zitat Liu P, Li, W, Zou L (2019) NULI at SemEval-2019 task 6: transfer learning for offensive language detection using bidirectional transformers. In: Proceedings of SemEval Liu P, Li, W, Zou L (2019) NULI at SemEval-2019 task 6: transfer learning for offensive language detection using bidirectional transformers. In: Proceedings of SemEval
Zurück zum Zitat Malmasi S, Zampieri M ( 2017) Detecting hate speech in social media. In: Proceedings of RANLP Malmasi S, Zampieri M ( 2017) Detecting hate speech in social media. In: Proceedings of RANLP
Zurück zum Zitat Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel, A (2019) Overview of the Hasoc track at fire 2019: hate speech and offensive content identification in Indo-European languages. In: Proceedings of FIRE Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel, A (2019) Overview of the Hasoc track at fire 2019: hate speech and offensive content identification in Indo-European languages. In: Proceedings of FIRE
Zurück zum Zitat Mandl T, Modha S, Kumar M A, Chakravarthi BR ( 2020) Overview of the hasoc track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Proceedings of FIRE Mandl T, Modha S, Kumar M A, Chakravarthi BR ( 2020) Overview of the hasoc track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Proceedings of FIRE
Zurück zum Zitat Modha S, Mandl T, Shahi GK, Madhu H, Satapara S, Ranasinghe T, Zampieri M (2021) Overview of the HASOC Subtrack at FIRE 2021: hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech. In: Proceedings of FIRE Modha S, Mandl T, Shahi GK, Madhu H, Satapara S, Ranasinghe T, Zampieri M (2021) Overview of the HASOC Subtrack at FIRE 2021: hate speech and offensive content identification in English and Indo-Aryan languages and conversational hate speech. In: Proceedings of FIRE
Zurück zum Zitat Mubarak H, Rashed A, Darwish K, Samih Y, Abdelali A ( 2021) Arabic offensive language on twitter: analysis and experiments. In: Proceedings of WANLP Mubarak H, Rashed A, Darwish K, Samih Y, Abdelali A ( 2021) Arabic offensive language on twitter: analysis and experiments. In: Proceedings of WANLP
Zurück zum Zitat Pamungkas, EW, Patti V (2019) Cross-domain and cross-lingual abusive language detection: a hybrid approach with deep learning and a multilingual lexicon. In: Proceedings ACL:SRW Pamungkas, EW, Patti V (2019) Cross-domain and cross-lingual abusive language detection: a hybrid approach with deep learning and a multilingual lexicon. In: Proceedings ACL:SRW
Zurück zum Zitat Pitenis Z, Zampieri M, Ranasinghe T (2020) Offensive language identification in Greek. In: Proceedings of LREC Pitenis Z, Zampieri M, Ranasinghe T (2020) Offensive language identification in Greek. In: Proceedings of LREC
Zurück zum Zitat Poletto F, Stranisci M, Sanguinetti M, Patti V, Bosco C ( 2017) Hate speech annotation: analysis of an Italian twitter corpus. In: Proceedings of CLiC-it Poletto F, Stranisci M, Sanguinetti M, Patti V, Bosco C ( 2017) Hate speech annotation: analysis of an Italian twitter corpus. In: Proceedings of CLiC-it
Zurück zum Zitat Ranasinghe T, Zampieri M (2021) An evaluation of multilingual offensive language identification methods for the languages of india. Information 12(8):1CrossRef Ranasinghe T, Zampieri M (2021) An evaluation of multilingual offensive language identification methods for the languages of india. Information 12(8):1CrossRef
Zurück zum Zitat Ranasinghe T, Zampieri M ( 2020) Multilingual offensive language identification with cross-lingual embeddings. In: Proceedings of EMNLP Ranasinghe T, Zampieri M ( 2020) Multilingual offensive language identification with cross-lingual embeddings. In: Proceedings of EMNLP
Zurück zum Zitat Ranasinghe T, Zampieri M (2021) Multilingual offensive language identification for low-resource languages. ACM transactions on asian and low-resource language information processing (TALLIP) Ranasinghe T, Zampieri M (2021) Multilingual offensive language identification for low-resource languages. ACM transactions on asian and low-resource language information processing (TALLIP)
Zurück zum Zitat Ranasinghe T, Zampieri M ( 2021) MUDES: multilingual detection of offensive spans. In: Proceedings of NAACL Ranasinghe T, Zampieri M ( 2021) MUDES: multilingual detection of offensive spans. In: Proceedings of NAACL
Zurück zum Zitat Ranasinghe T, Hettiarachchi H ( 2020) BRUMS at SemEval-2020 task 12: transformer based multilingual offensive language identification in social media. In: Proceedings of SemEval Ranasinghe T, Hettiarachchi H ( 2020) BRUMS at SemEval-2020 task 12: transformer based multilingual offensive language identification in social media. In: Proceedings of SemEval
Zurück zum Zitat Ranasinghe T, Sarkar D, Zampieri M, Ororbia A (2021) WLV-RIT at SemEval-2021 task 5: a neural transformer framework for detecting toxic spans. In: Proceedings of SemEval Ranasinghe T, Sarkar D, Zampieri M, Ororbia A (2021) WLV-RIT at SemEval-2021 task 5: a neural transformer framework for detecting toxic spans. In: Proceedings of SemEval
Zurück zum Zitat Ridenhour M, Bagavathi A, Raisi E, Krishnan S (2020) Detecting online hate speech: approaches using weak supervision and network embedding models. arXiv preprint arXiv:2007.12724 Ridenhour M, Bagavathi A, Raisi E, Krishnan S (2020) Detecting online hate speech: approaches using weak supervision and network embedding models. arXiv preprint arXiv:​2007.​12724
Zurück zum Zitat Rosenthal S, Atanasova P, Karadzhov G, Zampieri M, Nakov P(2021) Solid: a large-scale semi-supervised dataset for offensive language identification. In: Findings of ACL Rosenthal S, Atanasova P, Karadzhov G, Zampieri M, Nakov P(2021) Solid: a large-scale semi-supervised dataset for offensive language identification. In: Findings of ACL
Zurück zum Zitat Sarkar D, Zampieri M, Ranasinghe T, Ororbia A (2021) fbert: a neural transformer for identifying offensive content. In: Findings of the association for computational linguistics: EMNLP 2021, pp 1792– 1798 Sarkar D, Zampieri M, Ranasinghe T, Ororbia A (2021) fbert: a neural transformer for identifying offensive content. In: Findings of the association for computational linguistics: EMNLP 2021, pp 1792– 1798
Zurück zum Zitat Schwarm SE, Ostendorf M ( 2005) Reading level assessment using support vector machines and statistical language models. In: Proceedings of ACL Schwarm SE, Ostendorf M ( 2005) Reading level assessment using support vector machines and statistical language models. In: Proceedings of ACL
Zurück zum Zitat Tulkens S, Hilte L, Lodewyckx E, Verhoeven B, Daelemans W (2016) A dictionary-based approach to racism detection in Dutch Social Media. In: Proceedings of TA-COS Tulkens S, Hilte L, Lodewyckx E, Verhoeven B, Daelemans W (2016) A dictionary-based approach to racism detection in Dutch Social Media. In: Proceedings of TA-COS
Zurück zum Zitat Wiegand M, Siegel M, Ruppenhofer J ( 2018) Overview of the GermEval 2018 shared task on the identification of offensive language. In: Proceedings of GermEval Wiegand M, Siegel M, Ruppenhofer J ( 2018) Overview of the GermEval 2018 shared task on the identification of offensive language. In: Proceedings of GermEval
Zurück zum Zitat Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac,P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Le Scao T, Gugger S, Drame M, Lhoest Q, Rush A (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of EMNLP Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac,P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Le Scao T, Gugger S, Drame M, Lhoest Q, Rush A (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of EMNLP
Zurück zum Zitat Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of NeurIPS Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of NeurIPS
Zurück zum Zitat Yao M, Chelmis C, Zois D-S (2019)Cyberbullying ends here: towards robust detection of cyberbullying in social media. In: Proceedings of WWW Yao M, Chelmis C, Zois D-S (2019)Cyberbullying ends here: towards robust detection of cyberbullying in social media. In: Proceedings of WWW
Zurück zum Zitat Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) Predicting the type and target of offensive posts in social media. In: Proceedings of NAACL Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) Predicting the type and target of offensive posts in social media. In: Proceedings of NAACL
Zurück zum Zitat Zampieri M, Nakov P, Rosenthal S, Atanasova P, Karadzhov G, Mubarak H, Derczynski L, Pitenis Z, Çöltekin C (2020) SemEval-2020 Task 12: multilingual offensive language identification in social media (OffensEval 2020). In: Proceedings of SemEval Zampieri M, Nakov P, Rosenthal S, Atanasova P, Karadzhov G, Mubarak H, Derczynski L, Pitenis Z, Çöltekin C (2020) SemEval-2020 Task 12: multilingual offensive language identification in social media (OffensEval 2020). In: Proceedings of SemEval
Zurück zum Zitat Zhang J, Chang J, Danescu-Niculescu-Mizil C, Dixon L, Hua Y, Taraborelli D, Thain N ( 2018) Conversations gone awry: detecting early signs of conversational failure. In: Proceedings of ACL Zhang J, Chang J, Danescu-Niculescu-Mizil C, Dixon L, Hua Y, Taraborelli D, Thain N ( 2018) Conversations gone awry: detecting early signs of conversational failure. In: Proceedings of ACL
Metadaten
Titel
Predicting the type and target of offensive social media posts in Marathi
verfasst von
Marcos Zampieri
Tharindu Ranasinghe
Mrinal Chaudhari
Saurabh Gaikwad
Prajwal Krishna
Mayuresh Nene
Shrunali Paygude
Publikationsdatum
01.12.2022
Verlag
Springer Vienna
Erschienen in
Social Network Analysis and Mining / Ausgabe 1/2022
Print ISSN: 1869-5450
Elektronische ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-022-00906-8

Weitere Artikel der Ausgabe 1/2022

Social Network Analysis and Mining 1/2022 Zur Ausgabe

Premium Partner