1 Introduction

Swearing is the use of taboo language (also referred to as bad language, swear words, offensive language, curse words, or vulgar words) to express the speaker’s emotional state to their listeners (Jay, 1992, 1999). Not limited to face to face conversation, swearing also occurs in online conversations, across different languages, including social media and online forums, such as Twitter, typically featured by informal language and spontaneous writing. Twitter is considered a particularly interesting data source for investigations related to swearing. According to the study in Wang et al. (2014) the rate of swear word use in English Twitter is \(1.15\%\), almost double compared to its use in daily conversation (0.5–0.7%) as observed in previous work (Jay, 1992; Mehl & Pennebaker, 2003). The work by Wang et al. (2014) also reports that a portion of \(7.73\%\) tweets in their random sampling collection is containing swear words, which means that one tweet out of thirteen includes at least one swear word. Interestingly, they also observed that a list of only seven words covers about \(90\%\) of all the swear words occurrences in their Twitter sample: f*ck, sh*t, *ss, b*tch, n*gga, h*ll, and wh*re.

Swearing in social media can be linked to an abusive context, when it is intended to offend, intimidate or cause emotional or psychological harm, contributing to the expression of hatred, in its various forms. In such contexts, indeed, swear words are often used to insult, such as in case of sexual harassment, hate speech, obscene telephone calls (OTCs), and verbal abuse (Jay et al., 2006; Jay & Janschewitz, 2008). However, swearing is a multifaceted phenomenon. The use of swear words does not always result in harm, and the harm depends on the context where the swear word occurs (Jay, 2009a). Consider for instance the two following tweets containing swearing from the StackOverflow Offensive Comments dataset (Fišer et al., 2018):

If you don’t have the answer, move on to the next f*cking question and mind your own f*cking business

Sh_Khan: f*cking genius. Thank you

In the first example, it is obvious that the swear word is used to insult, thus this is an instance of abusive language. However, the second example shows the use of the same swear word in a casual setting, to emphasize an emotion of gratitude without intention to be offensive (Pinker 2007, emphatic swearing).

Some studies even found that the use of swear words has also several upsides. Using swear words in communication with friends could promote some advantageous social effects, including strengthen the social bonds and improve conversation harmony, when swear word is used in ironic or sarcastic contexts (Jay, 2009a). Another study by Stephens and Umland (2011) found that swearing in cathartic ways is able to increase pain tolerance. Furthermore, Johnson (2012) has shown that the use of swear words can improve the effectiveness and persuasiveness of a message, especially when used to express an emotion of positive surprise. Also accounts of appropriated uses of slurs should not be neglected (Bianchi, 2014), that is those uses by targeted groups of their own slurs for non-derogatory purposes (e.g., the appropriation of ‘nigger’ by the African–American community, or the appropriation of ‘queer’ by the homosexual community).

In recent years, more and more studies focused on abusive language detection which covers hate speech, cyberbullying, trolling, and offensive language (Waseem et al., 2017; Schmidt & Wiegand, 2017; Michal et al., 2010). Swear words play an important role in these tasks, providing a signal to spot an offensive utterance (Malmasi & Zampieri, 2018). However, the presence of swear words could also lead to false positives when they occur in a casual context (Chen et al., 2012; Nobata et al., 2016; Van Hee et al., 2018; Malmasi & Zampieri, 2018). Distinguishing between abusive and not-abusive swearing contexts seems to be crucial to support and implement better content moderation practices. Indeed, on the one hand, there is a considerable urgency for most popular social media, such as Twitter and Facebook, to develop robust approaches for abusive language detection, also for guaranteeing a better compliance to governments demands for counteracting the phenomenon (see, e.g., the recently issued EU commission Code of Conduct on countering illegal hate speech online (EU Commission, 2016)). On the other hand, as reflected in statements from the Twitter Safety and SecurityFootnote 1 users should be allowed to post potentially inflammatory content, as long as they are not-abusive.Footnote 2 The idea is that, as long as swear words are used but do not contain abuse/harassment, hateful conduct, sensitive content, and so on, they should not be censored.

In this work, we conduct an in-depth investigation on the role of swear words and their context in abusive language detection tasks. We explore the phenomenon of swearing in online conversation, taking the possibility of predicting the abusiveness of a swear word in a tweet context as the main investigation perspective. In this direction, the main goal is to automatically differentiate between abusive swearing, which should be regulated and countered in online communication, and not abusive one, that should be allowed as part of freedom of speech, also recognizing its positive functions, as in the case of reclaimed uses of slurs. To achieve this objective, we conduct several contributions. First, we develop a new benchmark Twitter corpus, called SWAD (Swear Words Abusiveness Dataset), where abusive swearing is manually annotated at the word level. Based on several previous studies (Jay, 2009a; Dinakar et al., 2011; Golbeck et al., 2017), we define abusive swearing as the use of swear word or profanity in several cases such as name-calling, harassment, hate speech, and bullying involving several sensitive topics including physical appearance, sexuality, race & culture, and intelligence, with intention from the author to insult or abuse a target (person or group). The other uses, such as reclaimed uses, catharsis, humor, or conversational uses, are considered as not-abusive swearing. Second, we develop and experiment with supervised models to automatically predicting abusive swearing within the tweet context. Such models are trained on the novel SWAD corpus to predict the abusiveness of a swear word within a tweet. Finally, we investigate the impact of swear word abusiveness on downstream abusive language detection tasks.

In this paper, we address the following research questions.

  • RQ1 How to model the swear word context in social media text as either abusive or not abusive? The abusiveness of a swear word strongly depends on its context. Therefore, we propose to explore the possibility of building a novel corpus that consists of tweets where swear words are annotated at the word level as either abusive or not abusive based on their use within their context.

  • RQ2 Is it possible to automatically predict the abusiveness of a swear word within the tweet context? To answer this question, we propose three different tasks, namely sequence labeling, text classification, and target-based swear word abusiveness prediction.

  • RQ3 Is the additional information about swear words abusiveness helpful for detecting abusive language? As part of the extrinsic evaluation of our corpus, we explore the impact of swear word context prediction in the downstream task of abusive language detection. We do so by infusing the swear word context prediction as an additional feature to the baseline models.

The contribution of this paper can be summarized as following:

  1. 1.

    We propose a novel corpus which focuses on studying the swear words context as either abusive or not abusive.

  2. 2.

    We propose a new task to predict the abusiveness of swear words within tweet context, taking some inspiration from the target-based sentiment analysis task.

  3. 3.

    We develop a novel architecture to predict the abusiveness of swear words within their context by adopting a similar idea to the previous study.

  4. 4.

    We leverage the swear word abusiveness feature to improve the baseline model in several downstream abusive language detection tasks.

This study is extended version of our previous work on predicting abusive swearing in social media (Pamungkas et al., 2020a), by providing an extensive literature study, corpus extension, and additional experiment to get a better insight for investigating the role of swear word context in abusive language detection tasks.

The paper is organized as follows. Section 2 introduces related work on swear word use and its context in online conversation. In addition, we also review some studies which investigate the relation between swear word use and abusive language detection task. Section 3 reports on the various steps of development of the SWAD Twitter corpus. Section 4 presents the experimental setting of predicting abusiveness of swear words and discusses the result. Then, Sect. 5 presents our experiment in investigating the impact of swear word abusiveness feature in several abusive language detection tasks. Finally, Sect. 6 includes conclusive remarks and ideas for future work.

2 Related works

2.1 Swearing in online content

Wang et al. (2014) examines the cursing activity on the social media platform Twitter.Footnote 3 They explore several research questions including the ubiquity, utility, and also contextual dependency of textual swearing in Twitter. On the same platform, Bak et al. (2012) found that swearing is used frequently between people who have a stronger social relationship, as a part of their study on self-disclosure in Twitter conversation. Furthermore, Gauthier et al. (2015) provide an analysis of swearing on Twitter from several sociolinguistic aspects including age and gender. This study presents a deep exploration of the way British men and women use swear words. A gender- and age-based study of swearing was also conducted by Thelwall (2008), using the social network MySpaceFootnote 4 to build their corpus. Recently, Cachola et al. (2018) studied vulgar words use in Twitter, by analyzing socio-cultural and pragmatic aspects of vulgarity based on users demographic data. Furthermore, they explored the impact of vulgar words use to the sentiment analysis task, which found that explicitly modeling vulgar words can boost sentiment analysis performance.

Besides social media, the study of swearing is also carried out on online communities. The study by Sood et al. (2012) focused on the use of profanity in an online community called Yahoo! BuzzFootnote 5. They explored several research questions including what are the pitfalls of current profanity detection systems, how profanity differs between different communities, and how different communities receive the swearing in various contexts. Recently, Rojas-Galeano (2017) aimed at tackling the difficulties in detecting obfuscated obscenities on Spanish and Portuguese online news sites. Kwon and Gruzd (2017) studied the contagious diffusion of offensive comments in the Donald Trump’s campaign video on YoutubeFootnote 6. They examined two kinds of swearing including: public swearing (when swearing has no specific target) and interpersonal swearing (the use of taboo words with a specific target).

2.2 Contextual swearing

Swearing is not always abusive—its abusiveness is context-dependent. Swearing context is explored by several prior studies. Fägersten (2012) classifies swearing context into two types, following the dichotomy introduced by Ross (1969): annoyance swearing, “occurring in situations of increased stress”, where the use of swear words appears to be “a manifestation of a release of tension”, and social swearing, “occurring in situations of low stress and intended as a solidarity builder”, which is related to a use of swear words in settings that are socially relaxed. Likewise, Allan and Burridge (2006) distinguishes the swearing contexts into casual context (when swear words do not cause insult, but are rather cathartic and humorous) and abusive context (when swear words are used with an intention to attack or insult).

The work by Jay (2009b) found that the offensiveness of taboo words is very dependent on their context, and postulates the use of taboo words in conversational context (less offensive) and hostile context (very offensive). These findings support prior work by Rieber et al. (1979) who showed that obscenities and swear words used in a denotative way are far more offensive than those used in a connotative way. Furthermore, Pinker (2007) classified the use of swear words into five categories based on why people swear: dysphemistic, exact opposite of euphemistic; abusive, using taboo words to abuse or insult someone; idiomatic, using taboo words to arouse the interest of listeners without really referring to the matter; emphatic, to emphasize another word; cathartic, the use of swear words as a response to stress or pain.

2.3 The role of swearing in abusive content

In recent years, abusive language detection is gaining interest from the research community. Swear words play a key role in this task, according to several works in the literature. Razavi et al. (2010) developed an automatic system for discriminating between regular texts and flames. They built a dictionary for this specific purpose called Insulting and Abusing Language Dictionary (IALD), which contains words, phrases, and expressions with several degrees of abuse and insult. Several swear words can be found among IALD entries, which are used as features in the automatic classification. Similarly, Chen et al. (2012) built a dictionary containing pejoratives, obscenities and profanities extracted from Urban Dictionary.Footnote 7 By combining both lexical features from their dictionary and syntactic features from dependency relations, their models were able to achieve high precision and recall in detecting both offensive content and offensive users. Mubarak et al. (2017) built a list of Arabic obscene words and hashtags by extracting patterns that are frequently used in offensive Twitter posts. This wordlist is used to classify a tweet into three classes: obscene, offensive, and clean. Recent studies also found that swear words are relevant to several related tasks including abusive language detection (Nobata et al., 2016), cyberbullying detection (Van Hee et al., 2018; Michal et al., 2010), and hate speech detection (Malmasi & Zampieri, 2018). The most recent study by Holgate et al. (2018) introduced six vulgar word use functions, and built a novel dataset based on them. They filtered their dataset based on presence of swear words from a list taken from the noswearing website.Footnote 8 Their results show that classifying vulgar word use by its function improves the system performance in detecting hate speech content.

2.4 Swear words corpora

The development of the swear word usage corpus was started by Holgate et al. (2018). They proposed a novel corpus, consisting of tweets containing swear words, where every swear word is annotated by six different labels based on its function. These vulgar function are including “express aggression”, “express emotion”, “emphasize”, “auxiliary”, “signal group identity”, and “non-vulgar”. The annotation process was done by using the crowd-sourced scenario. Furthermore, they built a model based on logistic regression coupled with several handcrafted features to classify the vulgar words function automatically. Pamungkas et al. (2020a) also introduced SWAD (Swear Words Abusiveness Dataset) corpus by filtering tweets from the OLID dataset (Zampieri et al., 2019a) based on swear word presence and annotating them with a binary label including “abusive” and “not-abusive”. They conducted the intrinsic evaluation of SWAD by predicting swear words’ abusiveness within a tweet as a context in two different models of prediction task, including sequence labeling task and text classification. Recently, Kurrek et al. (2020) also proposed a novel corpus that captures the online slur usage. The corpus consists of 39.8k human-annotated comments gathered from Reddit.Footnote 9 The annotation guideline outlines four main categories of online slur usage, divided into 12 sub-categories.

In this work, we follow a similar line as Holgate et al. (2018), which tries to model the pragmatic use of swear words to improve hate speech detection task. However, their work focuses on classifying the swear word used by its function and using it as an additional feature to detect hate speech utterances. In this work, we focus instead on the abusiveness prediction of swear words, rather than their function, to discover the context of a given swear word, whether abusive (should be eventually considered for content moderation, as it hurts) or not-abusive. Furthermore, we also adopt a similar task setting as target-based sentiment analysis to focus only on classifying the swear word’s context at the word-level. We also test the additional feature of swear word abusiveness information into four downstream hate speech detection tasks.

3 Corpus creation and analysis

3.1 Corpus collection

Our starting point was a corpus of tweets selected from the training set of the Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019a), which was proposed in the context of the shared task OffensEval (Zampieri et al., 2019b) at SemEval 2019.Footnote 10 This task is aimed at detecting offensive messages as well as their targets. In OLID, Twitter messages were labelled by applying a multi-layer hierarchical annotation scheme, which encompasses three dimensions, including tags for marking the presence of offensive language (offensive vs not offensive), tags for categorizing the offensive language (targeted vs untargeted), and tags for the offensive target identification (individual, group, or other). The broader coverage of the concept and definition of offensive language are the main reasons we choose this dataset as starting point for our finer grained annotation concerning swearing, rather than other datasets developed around more specific typologies of offensive language, such as hate speech, cyberbullying or misogyny, which we think could introduce a bias in our corpus, undermining the generality of its possible future exploitation.

Fig. 1
figure 1

Corpus development process

Table 1 Corpus statistic after filtering process

Some preprocessing has been applied to the OLID data, such as mention and URL normalization. Since our focus is on analyzing swear words in the tweet context, we first filtered out a subset of tweets from OLID based on the presence of swear words, in order to obtain a collection of tweets that include at least one swear word. At this stage we exploited the list of swear words published on the noswearing website,Footnote 11 an online dictionary site which includes a list of swear words. This dictionary includes 349 swear words covering general vulgarities, slurs, and sex-related terms. We manually checked the list to exclude highly ambiguous words, namely swear words like “ho” and “hard on”.Footnote 12 Table 1 shows the full statistics of our corpus after the filtering process. We identified 1,320 tweets that contain at least one swear word. Since this annotation task is at the (swear) word level, tweets which have more than one swear word were replicated. We generated as many new instances of the same tweet as the number of swear words occurring in the message, and marked each single swear word with special tags \(<b>\) and \(</b>\) (e.g. \(<b>\)f*ck\(</b>\), \(<b>\)sh*t\(</b>\), and etc.) so that the abusiveness label on each instance records the context of the marked swear word in the tweet (abusive or not). For instance, given the message @USER This sh*t gon keep me in the crib lol f*ck it, two instances will be generated: @USER This \(<b>\)sh*t\(</b>\) gon keep me in the crib lol f*ck it and @USER This sh*t gon keep me in the crib lol \(<b>\)f*ck\(</b>\) it.

We found 154 tweets having more than one swear word, with a range of occurrences from 2 to 6 swear words. As a result, we have 1511 instances to be annotated. Figure 1 shows the overall process of our corpus development.

3.2 Annotation task and process

The annotation of 1511 instances involved three expert annotators (the authors), with different gender and ages. All instances were annotated by two independent annotators (A1 and A2). The resulting disagreement was resolved by involving a third annotator (A3), labeling those instances where a disagreement between A1 and A2 was detected. All annotators use English as a second language, with a minimum level of B2. The annotators involved conducted the process with particular care by adopting a cautious attitude, carefully discussing the disagreement, and consulting native speakers when in doubt.

3.2.1 Annotation task

Annotators were asked to annotate (with a binary option) whether the highlighted swear word (tagged with the \(<b>\) and \(</b>\) tags) can be considered abusive swearing, contributing to the construction of an abusive context (by using the tag “yes”) or whether the swear word does not contribute to the construction of an abusive context (by using the tag “no”). We first started a trial annotation on a portion of 100 tweets from the collection, to test our annotation guidelines and improve the understanding between annotators. During this trial annotation we also deepened our understanding of the offensiveness notion, which underlies the definition of offensive language driving the whole OLID annotation process. There is a crucial difference between the coarse notion of offensive language as defined in OLID and the concept of abusive language we are interested in, given our main goal to reason about abusive swearing. Indeed, according to the OLID definition a tweet can be considered offensive only because of the presence of profanities, even if no occurrence of abusive swearing can be detected.

Such considerations have driven our decision to annotate the abusiveness of swear words on tweets belonging to both classes (offensive and not-offensive) of the OLID data. Another issue discovered during the trial annotation consisted in some cases where the swear word is used for indirect insult: the swear word itself is used to insult, but the overall context of the tweet is not abusive. This mostly happened in the reported speech such as in the Example 3.1 below, where we determined this tweet as not abusive:

[Example 3.1. Indirect insult.] @USER Everyone saying f*ck Russ dont know a damn thing about him or watched the interview

Therefore, in the final annotation guidelines, we decided to include the author intention to resolve the swear word context, especially to deal with this kind of swear word use. We consider abusive swearing those uses where swearing contributes to the construction of an abusive context such as name-calling, harassment, hate speech, and bullying, involving several sensitive topics including physical appearance, sexuality, race and culture, and intelligence, with intention from the author of tweet to insult or abuse a target (person or group of persons). Let us notice that one tweet can have more than one swear word, but for every tweet, only one swear word will be highlighted as relevant for the annotation in each row (see the replication process explained above). Therefore, the annotator only needs to focus on the marked swear words (e.g., \(<b>\)f*ck\(</b>\)). We remark again that abusive swearing can be found on both offensive and not-offensive tweets, therefore during the application of our annotation layer, we decided to ignore the original message-level layer of annotation from the original OLID (offensive vs not-offensive), in order to avoid confusing the annotators during the annotation process. Indeed, we observed four possible cases, when we consider the OLID original labels on the offensiveness of a tweet, namely: (i) the message is offensive and the swear word is abusive, (ii) the message is offensive but the swear word is not abusive, (iii) the message is not offensive but the swear word is abusive, and (iv) the message is not offensive and the swear word is not abusive. Let us provide an example for each case to get a better understanding on such circumstances:

[Example 3.2. Offensive tweet & abusive swearing] @USER You are an absolute d*ck

[Example 3.3. Offensive tweet & not abusive swearing] @USER I was definitely drunk as sh*t

[Example 3.4. Not offensive tweet & abusive swearing] @USER b*llshit there’s rich liberals too so what are you saying ???

[Example 3.5. Not offensive tweet & not abusive swearing] @USER Haley thanx! you know how to brighten up my sh*tty day

3.3 Annotation results and disagreement analysis

Referring to the application of two independent annotations on the whole dataset of tweets (A1 and A2), we can say that annotators achieved a good agreement, selecting the same value in a large portion of the annotated tweets being only 216 out of 1511 the messages where they disagreed by marking in a different way the presence of abusive swearing. The average pairwise agreement percentage amounts to 85.70%. The inter-annotator agreement is 0.652 (Cohen’s kappa coefficient), which corresponds to a substantial agreement. The final SWAD annotated corpus consists of 1511 unique swear words immersed in the context of 1320 tweets, where 620 swear words are marked as abusive and 891 are rated as not-abusive.Footnote 13 Table 2 shows the detailed distribution of our annotation result. Interestingly, we found more not-abusive swearing than abusive ones in tweets belonging to the offensive class of OLID (728 versus 568). In addition, we also found 52 cases of abusive swearing in tweets belonging to the OLID not-offensive class.

Table 2 Label distribution in the SWAD dataset

In the following we list and share some interesting findings and elements of discussion related to the annotation task and outcome.

3.3.1 Most of the non-abusive contexts of swearing are dominated by emphatic and cathartic swearing function

Cathartic swearing is a swear word function when it is used as a response to pain or misfortune (see Example 3.6), while emphatic swearing is another swear word function when a swear word is used to emphasize another word in order to draw more attention (see Example 3.7).

[Example 3.6. Cathartic function] @USER d*mn I felt this shit Why you so loud lol

[Example 3.7. Emphatic function] @USER I AM F*CKING SO F*CKING HAPPY

3.3.2 Emojis could become an important signal to resolve the context of a swear word within the tweet

In some tweets when the context of swear word use is difficult to be resolved, the presence of emojis could give key information. As shown in Example 3.8, without the presence of the emoji, the swear word fucking seems to contribute to the construction of an abusive context, but the presence of the Face with Tears of Joy emoji helped annotators to understand the real context of the whole tweet.

[Example 3.8. Use of emojis] @USER ur a f*cking dumbass fr. there’s no way she is anyone else’s

3.3.3 Irony and sarcasm could provide an issue for automatic prediction based on machine learning approach

We found some tweets which contain sarcasm and irony, most of the times in not-abusive context. As in other related tasks such as sentiment analysis, irony and sarcasm could contribute to the difficulties of this task. An example of tweet where these phenomena are expressed can be seen in Example 3.9.

[Example 3.9. Irony and sarcasm issues] @USER Yeah we need some more made up b*llshit protestors and antifa lol time for an epic beatdown

Furthermore, we analyzed cases of disagreement between annotators. We conducted a manual analysis of 216 disagreement cases with the aim to extract the most common patterns, which contribute to the difficulty of the annotation task. As a result, we found several difficult cases:

3.3.4 Missing context

We found that some tweets are very short, resulting in the context missing (see Example 3.10). Other instances are also challenging to understand due to the presence of grammatical errors (see Example 3.11). These issues are very dominant in the annotator disagreement cases.

[Example 3.10. Very short tweet] @USER Lmfaoo!

[Example 3.11. Noisy text with grammatical errors] @USER d*mn that headgear is lit sucks im not on pc ubi plz for console to

3.3.5 Need of world knowledge to understand the context

Some tweets are also very difficult to understand due to the lack of world knowledge, as shown in Example 3.12. Sometimes annotators need to gather more information by using search engine to understand the context. The presence of hashtags usually becomes the key to understand the nature of the context.

[Example 3.12. Difficult to understand] @USER @USER It’s probably better to have an next to my name than a pink p*ssy hat on my head #MAGA #MakeAmericaGreatAgain

3.4 Corpus extension

After completing the full annotation process, SWAD consists of 1511 instances. We realize that this collection is still relatively small to obtain reliable performance for machine learning models. Therefore, we extended SWAD by conducting another round of annotation. We also included in the collection the test set of the OLID dataset, which contains 860 tweets, and we re-annotated tweets from Holgate’s dataset (Holgate et al., 2018) according to the SWAD guidelines. Similar to SWAD, tweets in Holgate’s dataset were filtered based on the presence of vulgar words. Then, all instances of vulgar words were annotated with one of the six categories of vulgar word use by using a crowdsourcing approach. They introduced six mutually exclusive labels, namely express aggression (AGG), express emotion (EMO), emphasis (EMPH), auxiliary (AUX), signal group identity (SGI), and non-vulgar (NV) use. The idea of including the Holgate’s dataset in our collection, by applying to the data the SWAD annotation scheme, was stimulated by the possibility of investigating the interaction of our swear word abusiveness label with the swear word function, as introduced in the Holgate’s study.

We annotated the new data by following the same annotation guidelines described in Pamungkas et al. (2020a) and involving the same pool of three expert annotators. In the case of the OLID test set, we got 66 instances after the filtering and replicating process. For Holgate’s dataset, we only selected the first 1000 tweets to be re-annotated. We re-annotated all tweets regardless of their original labels. To avoid bias in the annotation process, we hide the original label based on Holgate’s study from our annotators’ view. Our effort was therefore towards adding another layer of annotation on the swear word. After the annotation process, we obtained 204 instances annotated as abusive and 796 as not abusive for Holgate’s data. Meanwhile, for the OLID test, we obtained 18 instances annotated as abusive and 48 instances as not abusive. The inter-annotator agreement on this corpus extension is 0.516 based on the Cohen Kappa coefficient from the annotation of the first and second annotators on 1066 instances. Therefore, we have 2577 tweets in total after this extension process. Table 3 shows the interaction between the original label from Holgate’s work and our new label addition.

Table 3 Interaction between the original Holgate’s label with our annotation

Before the annotation process, we expected that most tweets with AGG label will be classified into abusive class. However, we found that these AGG tweets were distributed in both abusive and not abusive classes in a similar proportion. We were interested in AGG tweets, which are categorized into not abusive class. Some examples of these instances are reported in the following. Based on our annotation guidelines, the first example of swear word use (Example 3.13) is labeled as not abusive because the insulted target does not exist. Similarly, the second example (Example 3.14) shows an expression of humor and catharsis, which is not classified as abusive based on our annotation guidelines.

[Example 3.13. Not abusive based on our guidelines] My b*llshit radar is on full force today

[Example 3.14. Humor and catharsis] I gained ten pounds this summer. D*mn... L0L

4 Swear words abusiveness prediction

In this section, we provide an intrinsic evaluation of the corpus by conducting cross-validation experiments. We built supervised machine learning models to predict the abusiveness of swear words in SWAD. We model this prediction task in three different tasks, namely sequence labeling, simple text classification, and target-based swear word abusiveness prediction. The main objective of the sequence labeling experiment is to test the consistency of the annotation of the corpus. Meanwhile, we devise the classification experiment to shed some light on the most predictive feature to differentiate between abusive and not-abusive swearing. We also propose to adopt a target-based sentiment analysis task, a more well-explored task in the sentiment analysis area, into our experiment, as presented in the following subsection (see Sect. 4.3).

4.1 Sequence labeling task

In order to test the robustness of the annotation of swear words in SWAD, we devise a cross-validation test based on a sequence labeling task. Given a sequence of words (i.e., a tweet from our dataset), the task consists in correctly labeling each word with one of three possible labels: abusive swear word (SWA), non-abusive swear word (SWNA) or not a swear word (NSW). The task is carried out in a supervised fashion, by splitting the dataset in a training set (90% of the instances) and a test set (the remaining 10%).

4.1.1 Model description and evaluation

For this experiment, we adapt the BERT Transformer-based architecture (Devlin et al., 2019) with the pre-trained model for English bert-base-cased. We train the model for 5 epochs, with learning rate \(10^{-5}\) and a batch size of 32.

4.1.2 Results

Table 4 Sequence labeling task: confusion matrix
Table 5 Sequence labeling task: results broken down by label
Table 6 Ablation test on several feature sets

Table 4 shows the confusion matrix resulting from the cross-validation. Unsurprisingly, the majority of classification errors are due to SWA/SWNA confusion, while the distinction between swear words and non-swear words is basically trivial. The classifier is slightly biased towards abusive swear words (217 SWA\(\rightarrow \)SWNA misclassifications) than non-abusive swear words (455 SWNA\(\rightarrow \)SWA misclassifications). These results are confirmed by the performance measured in terms of per-class precision, recall and \(F_1\)-score, shown in Table 5, where the SWA class has a higher recall than precision, while the opposite is true for the SWNA class. In absolute terms, the per-class and macro \(F_1\)-score confirms that our annotation is stable when tested in a supervised learning setting. In our test, only one abusive swear word was misclassified as NSW. Interestingly, the word is sk*nk, which is semantically ambiguous, conveying the offensive sense as well as the animal sense. Even more interestingly, the few NSW instances misclassified as SWA are all borderline cases of abusive language: sh*tcago (an offensive slang for Chicago), messed, c*mming, and c*mslave.

4.2 Simple text classification task

In this setting, we explicitly predict the abusiveness of swear words (as the target word) in given tweets as context. We employ several machine learning models including a linear support classifier (LSVC), logistic regression (LR), and random forest (RF) classifier. We use different features, at the word level (focusing on the target word) and at the tweet level (identifying the context).

4.2.1 Features

Lexical features In this feature set, we focus on the word-level features. We include the Swear Word feature, that is, the unigram of the marked swear word, as we aim to investigate whether the abusiveness of a swear word could be predicted only from the word choice. We also use the Bigrams feature, obtained from bigrams of the target word with its next and previous words.

Twitter features Since our corpus consists of tweets, we also employ several features which are particular to the Twitter data. This feature set include Hashtag Presence, Emoji Presence, Mention Presence, and Link Presence. We use regular expressions to extract hashtags, mentions and URLs, and a specialized libraryFootnote 14 for emoji extraction.

Sentiment features This feature is proposed in order to resolve the context of the tweet. We use two features: Text Sentiment, to model the polarity of the text, and Emoji Sentiment to model the overall sentiment of the emojis in the tweet. We use the VADER dictionary (Hutto et al., 2014) to extract the polarity score of the text and emoji sentiment rankingFootnote 15 to get the sentiment value for emojis.

Stylistic features In this feature set, we consider several common stylistic features for text classification task such as Capital Word Count,Footnote 16Exclamation Mark Count, Question Mark Count, Text Length. In addition, we also exploit another word-level feature, namely Swear Word Position, indicating the index position of the marked swear word in the tweet.

Syntactic features In this feature set, we focus on the word-level features, including Part of Speech and the Dependency Relation of the target word with its next and previous words. We extract part-of-speech tags with the NLTK library,Footnote 17 while dependency relations are extracted with SpaCy.Footnote 18

4.2.2 System description and evaluation

We build our models by using the Scikit-learn library.Footnote 19 We split the dataset into 80% and 20% for the training and testing respectively. We use several evaluation metrics, including accuracy, macro average precision, macro average recall, and macro average F-score. An ablation test is performed to investigate the role of each feature set in the classification result. The swear word unigram feature is used as a baseline in this experimental setting.

4.2.3 Results

Table 6 shows the full results of the text classification experiment by using LSVC, LR, and RF models. We start the experiment by using all feature groups altogether. Then, we remove one feature at a time to see the importance of each feature group in the model performance. Overall, RF is under-performing compared to the two other classifiers. The results also show that LR performed the best compared to two other models. Based on the macro average F-score, the best performance is achieved using all the features coupled with LR. With the same model and by removing Bigrams feature also obtained similar performance, but a lower macro average recall. Our goal is to investigate the most predictive feature set in the ablation experiment by removing one feature set at a time. We found that the unigram of a swear word is the most informative feature in this classification task. Bigrams, sentiment, emotion, stylistic and syntactic features all contribute to the classification performance, while the Twitter features have a detrimental effect on the LSVC and RF models. The main issue of this task is the lower recall compared to the precision, which is consistent across all models. It denotes that such models struggle to deal with false-negatives. We argue that this happens due to the dataset imbalance, where the swear words percentage over both classes is dominated by not-abusive class (negative class).

4.3 Target-based abusiveness prediction of swear words

This setting is similar to the text classification task presented in Sect. 4.2. However, here we explicitly model the task by adopting a similar setting as the target-dependent sentiment analysis task (Vo & Zhang, 2015; Saeidi et al., 2016). The main objective of this task is to identify the sentiment polarity of a given target in an utterance. This task is also related to aspect-based sentiment analysis. However, in target-dependent sentiment analysis, the target word is known and mentioned explicitly in the given utterance. Meanwhile in aspect-based sentiment analysis, the target aspect could be expressed implicitly, where the aspect detection is also part of the task. Adopting a similar idea of target-dependent sentiment analysis, we use the swear word as the target word, with the main objective to predict its abusiveness in a given utterance as a context.

[Example 4.1. Tweet from Davidson’s dataset] @USER d*mn I hate a b*tch that like to argue and sh*t

In Example 4.1, we can find three swear words in the tweet, i.e., “d*mn”, “b*tch”, and “sh*t”. Therefore, there are three target words, and the task is to predict the abusiveness of each swear word in the tweet, individually. Based on our manual investigation, the first swear word is not abusive, the second one is abusive, while the third one is more difficult to assess. The first swear word is used to express catharsis, which is not abusive in most of the cases. The second swear is abusive because it can insult some targets. The last swear word is a bit problematic since the swear word is used as an idiomatic expression. The abusiveness of a given swear word is highly dependent on its context in the tweet, which is identical to the target-dependent sentiment analysis task.

4.3.1 System description and evaluation

In this experiment, we adopt several state-of-the-art models from the target-dependent sentiment analysis task as baseline models. In addition, we also implement a BERT model by applying a simple masking approach to mark the target words. We evaluate the model’s performance by using several standard evaluation metrics, including precision, recall, F-score and accuracy. We present precision, recall, and F-score on both positive and negative classes. We split our extended SWAD corpus into training (70%), development (10%), and testing (20%) sets for the experiment. Following is a short description of each model we use in our experiment:

  • TD-LSTM (Target-dependent LSTM) The basic idea of this architecture is to model the preceding and following context surrounding the target word so that the feature representation consists of the left part (preceding the target word) and the right part (following the target word) (Tang et al., 2016). Specifically, this architecture consists of two LSTMs (LSTM left and LSTM right), which model the preceding and following target word context, respectively. The output of these LSTMs is then concatenated to the softmax layer to predict the label.

  • TC-LSTM (Target-connection LSTM) This architecture is a further development of TD-LSTM, which tries to incorporate a target connection component. The additional component explicitly models the connection between the target word and each context of the word when building the sentence representation (Tang et al., 2016). This component was implemented as a target word vector obtained by averaging the vectors of context work of words it contains. This vector is then concatenated to the word representation before feeding it to the LSTM network. The rest of the architecture is similar to the TD-LSTM.

  • AE-LSTM (Aspect Embedding LSTM) This architecture (Wang et al., 2016) tries to learn the embedding vector of each aspect, or in our study, is the target word. This vector is then concatenated to the sentence embedding representation, which is followed by the LSTM network. The additional vector representation of the target word gives vital information to the model to learn the sentiment for each target word.

  • AT-LSTM (Attention-based LSTM) The standard LSTM is not able to model the important part of aspect-based sentiment classification. This particular model (AT-LSTM) (Wang et al. 2016) has an attention mechanism which captures the important part of a sentence by focusing on the given aspect. This attention mechanism took input from the hidden layer produced by LSTM and aspect embedding vector and produce an attention weight vector and a weighted hidden representation.

  • ATAE-LSTM (Attention-based LSTM with Aspect Embedding) Basically, this architecture (Wang et al., 2016) is AT-LSTM model which is concatenated with aspect embedding vector as implemented in AE-LSTM.

  • CABASC (Content Attention Based Aspect Based Sentiment Classification) This architecture consists of two enhanced attention mechanisms (Liu et al., 2018), including sentence-level content attention mechanism which captures the important information about given aspects from a global perspective and the context attention mechanism, which simultaneously takes the order of the words and their correlations into account, by embedding them into a series of customized memories.

  • IAN This architecture uses two LSTM networks to model the sentences and the target words (Ma et al., 2017). Then, the target word’s hidden state and the hidden state of context sentence are placed in parallel to generate an attention vector interactively. Finally, these attention vectors provide a sentence representation and target representation.

  • RAM (Recurrent Attention on Memory) This framework implements a multiple-attention mechanism that captures sentiment features separated by a long distance so that it is more robust against irrelevant information (Chen et al., 2017). The outputs of these multiple attentions are non-linearly combined with the LSTM network, strengthening the model for handling more complications.

  • TD-BERT (Target-dependent BERT) We also propose to adopt the idea of TD-LSTM and exploit the state-of-the-art pre-trained model BERT as language representation. Therefore, our model consists of two BERT layers (BERT left and BERT right) to represent the context of preceding and following target words, respectively. The output of these BERT layers is passed into a fully connected dense layer with RELU activation before going into the last sigmoid layer to produce the final prediction. This model is optimized using Adam Optimizer with a learning rate of 1e-5 and trained with three epochs and batch size at 32.Footnote 20

  • TM-BERT (Target-masked BERT) BERT model has an attention mechanism to model many downstream tasks which involve single text or even text pairs. BERT encodes multiple text segments using two special tokens ([SEP] and [CLS]). [SEP] token is used to separate two or more text segments in case of multiple text segment processing. In the single text, the encoded text is started by [CLS] token and ended by [SEP] token. In this model, we add a special token [SW] and [SW] to mark the swear word in the sentence. The intuition for doing so is to inform the important part (target word) of the text for the model. We expect the BERT model able to construct the representation by focusing on this special token. We use an open-source implementation of BERT by HuggingFace,Footnote 21 which provides a special method to add a new special token in the BERT masking process.Footnote 22

4.3.2 Results

Table 7 Result of target-based abusiveness prediction of swear words

As shown in Table 7, the TM-BERT obtained the best result with .665 in F-score in positive class, .843 in F-score in negative class, and .754 in macro F-score. Overall, the BERT-based models achieved a better result than other models, where TD-BERT also obtained a competitive result in all evaluation metrics. We also notice that TD-LSTM and TC-LSTM get better results than the rest of non-BERT models, including CABASC and RAM, which achieved better performance in several benchmarks for the aspect-based sentiment analysis task (Liu et al., 2018). We also compare our result in this experiment with the results in our previous experiment, as presented in Table 6. The overall result shows that our models presented in this experiment, which are based on neural architecture, outperformed the traditional models. We also notice that most of the models exploited in this experiment are able to cope with the dataset imbalance issue, as we discovered in previous experiments with traditional models, where we obtained lower recall than precision.

5 Swear words in abusive language detection

5.1 Task description and experimental settings

In order to answer the third research question (RQ3), we explore the usefulness of the swear word abusiveness information feature on several downstream abusive language detection tasks. We reiterate that our assumption is that knowing the swear word context as either abusive or not could help the system resolve the abusiveness of the whole utterance. Therefore, our idea is to explicitly infuse the swear word abusiveness prediction into the abusive language detection model to help the model dealing with swear word ambiguity. The overall experimental scenario is illustrated in Fig. 2.

First, we need to select some abusive language benchmarks which contain a high frequency of swear words. We found four dataset collections, including HatEval (Basile et al., 2019), AMI@IberEval (Fersini et al., 2018), AMI@Evalita (Fersini et al., 2018), and Davidson (Davidson et al., 2017) datasets. These datasets contain a fairly high frequency of swear words. Around half or their instances are containing swear words, specifically 42.26%, 56.74%, 62.79%, and 69.2% for HatEval, AMI@Evalita, AMI@IberEval, and Davidson dataset respectively. Following is a short description of each dataset.

Fig. 2
figure 2

Process to infuse additional features

5.1.1 HatEval dataset

The dataset focuses on the detection of hate speech in Twitter on two specific targets, namely immigrants and women, in a multilingual perspective (Basile et al., 2019). The HatEval shared task introduced a dataset in two languages, English and Spanish. However, we will only focus on the English collection. The HatEval collection was gathered by using several keywords, including neutral keywords, pejorative words towards targets, and highly polarized hashtags. This dataset was annotated by relying on judges from a crowdsourcing platform, which applied an annotation scheme including three binary labels: hate speech (hate speech or not), target range (generic or individual), and aggressiveness (aggressive or not). The final dataset used for the English HatEval shared task contains 13,000 (about 10,000 for training and for 3000 testing).Footnote 23

5.1.2 AMI datasets

Basically, datasets for AMI@IberEval (Fersini et al., 2018) and AMI@Evalita (Fersini et al., 2018) were selected from the same collection of tweets, which were filtered using three approaches including querying from Twitter streaming API based on some keywords, monitoring account of online harassment victims, and downloading tweets from misogynist accounts. This dataset was annotated with three annotation layers including misogyny identification (misogyny or not), misogynistic behaviours (stereotype, dominance, derailing, sexual harassment, and discredit), and target of misogyny (active or passive). In this task we focus only on the misogyny identification task, where models need to predict whether a given tweet as either misogynous or not. AMI@IberEval dataset contains 3977 tweets (3251 training and 831 testing), while AMI@Evalita collection contains 5000 tweets (4000 training and 1000 testing). Originally, AMI dataset is available in three languages including English, Italian, and Spanish, but here we only focus on English.

5.1.3 Davidson dataset

The dataset has been built by Davidson et al. (2017) and contains 24,783 tweetsFootnote 24 manually annotated with crowdsourcing scenario. Differently from the other datasets considered, in this dataset a multilabel annotation is applied, with three labels including hate speech, offensive, and neither. These tweets were sampled from a collection of 85.4 million tweets gathered using the Twitter search API, focusing on tweets containing keywords from HateBase.Footnote 25 Only 5.8% of the total tweets were labeled as hate speech and 77.4% as offensive, while the remaining 16.8% were labelled as not offensive.

The second process is to predict the abusiveness of swear words in each instance of these datasets. We pre-processed all instances of these datasets similarly as we did to the SWAD (see Fig. 1), including marking the swear word and replicating instance when more than one swear words are found. After a preprocessing step, we immediately predict all preprocessed instances by employing our best performing system based on results presented in the previous section, which is TM-BERT. We aggregate the prediction score for instances which contain more than one swear words by taking minimum (MIN), maximum (MAX), and average (AVG) score. In case of instances which do not containing swear word, we set the prediction score to 0.

The final process is to infuse the swear word abusiveness prediction score into the base model for detecting abusive language in these respective tasks. However, note that this work aims not to produce the best possible system for these shared tasks but rather to test our hypothesis on the usefulness of predicting the pragmatics of swear word use. For this experiment, we employ a straightforward BERT model with a minimum hyperparameter tuning. We use (bert-base-cased) model available on TensorFlow-hubFootnote 26, which allows us to integrate BERT with the Keras functional layerFootnote 27. Our network starts with the BERT layer, which takes three inputs consisting of id, mask, and segment before passing into a dense layer with RELU activation (256 units) on top and an output layer with sigmoid activation. We train the network with the Adam optimizer with a learning rate of \(2^{-5}\).Footnote 28 We tune this model by trying several combinations of batch size (32, 64, 128) and the number of epochs (1–5). We infuse the additional feature by simply concatenating the swear word abusiveness probability into the dense layer after the BERT embedding layer.

5.2 Results

We apply standard evaluation metrics in this experiment, including a wide coverage of evaluation metrics such as precision, recall, F-score, and accuracy. We present precision, recall, and F-score on both positive and negative classes to picture the system performance better. Tables 8, 9, 10, and 11 present the result of the experiments on HatEval task, AMI@Evalita task, AMI@IberEval, and Davidson dataset, respectively. As mentioned before that, we experiment with three additional features, namely MIN, MAX, and AVG. These additional features depict the approach in aggregating the abusiveness score when more than one swear words exists in the tweet. We marked with superscript (*) results where the performance improvement is statistically significant compared to baseline models (\(F_{avg}\) and Acc columns).Footnote 29

On the HatEval task, the additional feature was able to improve the model performance. The best result is obtained using the MIN score with .482 in the macro average F-score with a statistical significance compared to the baseline model. A similar result is observed on both AMI@Evalita and AMI@IberEval task datasets. All models infused by additional features are experiencing performance improvement significantly, where the best result was obtained by using the MIN aggregation score. The performance improvement is consistent in both classes, as observed from the F-score in the positive (\(F_1\)) and negative (\(F_0\)) class. However, a different result was observed in the experiment of the Davidson dataset as presented in Table 11. We found that the additional features were not able to augment the model performance.

It was an interesting finding that the MIN aggregation is recognized as the most effective approach on most datasets. Based on our further investigation, we found two possible reasons which lead to this result. First, we found several examples where two or more swear words were used in different abusiveness degrees within one tweet. As shown in the Example below, which is taken from AMI Evalita collection. Our model predicted the first swear word with a high abusiveness degree, while the second one with a low degree of abusiveness. With the MIN aggregation, the additional feature informs the model that there is an not-abusive swear word, which could become an important signal to resolve the context of the whole message. On the contrary, if we use MAX aggregation, the additional feature could also deceive the model. In this case, MIN aggregation would provide a better knowledge for the model. Second, there are many instances of HatEval, AMI@Evalita, and AMI@IberEval, which contain more than one swear word. Therefore, aggregating the score in a better way would heavily influence the prediction result, where in this case MIN aggregation provides better information for the models.

[Example 5.1. Not Misogyny tweet from AMI Evalita dataset] everytime i reach the highlights of smut im reading me. ok ho* calm down calm down sit your *ss relax its just a smut

Regarding to the peculiar result on Davidson dataset, we conducted a deeper investigation. We notice that our models struggle to detect the hate speech class as observed in Table 11, where the micro F-score in hate speech class was very low. Furthermore, our additional feature also failed to improve the model performance in determining the hate speech instances. Our manual inspection of the dataset highlights that our swear word abusiveness prediction model struggles to differentiate between the swear word in the offensive class and the hate speech class. For example, as shown in the examples below (Example 5.2 and Example 5.3), we can see that our model predicts the swear word use in both classes with a high abusiveness degree. Even with human reasoning, we also could not differentiate the abusiveness degree of the swear words in both messages. We argue that this issue is the main reason for the less impact of our additional features in the Davidson dataset.

[Example 5.2. Offensive tweet from Davidson’s dataset] @USER @USER so you was in a female DMs talking to another n*gga... You’re a f*ggot...

[Example 5.3. Hate Speech tweet from Davidson’s dataset] Vanessa is such a f*ckin f*ggot.

Table 8 Result of investigating swear words role in HatEval task
Table 9 Result of investigating swear words role in AMI Evalita task
Table 10 Result of investigating swear words role in AMI IberEval task
Table 11 Result of investigating swear words role in Davidson dataset

6 Conclusion and future work

The research presented in this paper investigates the automatic classification of abusive swearing. We developed a new benchmark corpus called SWAD, consisting of English tweets, where abusive swearing is manually annotated at the word-level. Our initial corpus consists of 1511 instances of swearing from 1320 tweets, where 620 swear words were annotated as abusive and 891 marked as not-abusive. The inter-annotator agreement is 0.708, based on Cohen’s Kappa coefficient, which denotes a substantial agreement. Furthermore, we extended our corpus to improve the coverage when it is used by statistical models. We added 66 tweets from the OLID test set, which were missing from the first round of annotation, and 1000 instances from Holgate’s dataset. Our second annotation process labeled 204 instances annotated as abusive and 796 instances as not abusive. The annotator agreement for the second annotation process is 0.516, which moderate agreement is achieved. Our final collection consists of 2577 instances from 2282 tweets.

We built models trained on the SWAD corpus to automatically classify abusive and not-abusive swear words and provide an intrinsic evaluation of SWAD. We experimented by modeling this task into three different settings, namely, sequence labeling, simple text classification, and target-based swear word abusiveness prediction. We used BERT for sequence labeling, simpler but more transparent models for text classification, and wide coverage of models, including several state-of-the-art models in aspect-based sentiment analysis for the target-based task. Our results confirm that our annotation is robust as shown by the sequence labeling performance. On the other hand, text classification results provided new insights on the most predictive features for distinguishing abusive and not-abusive swear words. In particular, we found that a wide range of features can actually improve the models’ performance. Meanwhile, our intention to model the task similarly to aspect-based sentiment analysis leads to promising result. Our BERT-based models obtained the best result in this setting, significantly better than simple text classification settings where we implemented more traditional models.

Finally, we explore the usefulness of predicting swear words’ abusiveness on several downstream abusive language detection tasks. Based on models built for swear word abusiveness prediction (RQ2), we introduce a novel feature, namely the swear word abusiveness feature, and infuse it to improve current abusive language detection models. We test our approach to several abusive language detection tasks, including HatEval, AMI@Evalita, AMI@IberEval, and Davidson dataset, showing consistent and significant performance improvement across topics, except the Davidson dataset. Our further investigation discovered that the different notion of annotation in the Davidson dataset was the main reason why our feature was not impactful.

While these results are encouraging, we believe that there is still room for improvement for both the corpus and the automatic classification of swearing. Furthermore, we aim to improve the dataset by proposing a fine-grained categorization of swear words such as the ones introduced by Pinker (2007) and McEnery (2006)). We also plan to apply our swear word abusiveness feature into more tasks and datasets (Poletto et al., 2021), to obtain the full picture of its impact in abusive language detection tasks. We also observe the possibility to use the additional swear word abusiveness features as domain independent features which proven as important features to transfer knowledge in cross domain abusive language detection (Pamungkas et al., 2020b; Pamungkas & Patti, 2019; Chiril et al., 2021). We are also aware that the results of different aggregation approaches are heavily depended on the dataset, and this will deserve further investigation.

Applying our methodology to other languages is not trivial, as it depends on the availability of language resources and robust NLP tools for them (Pamungkas et al., 2021). Fortunately, full-fledged NLP pipelines do exist for many languages, thanks for instance to large-scale initiatives such as Universal Dependencies, which provides among its deliverables the UDpipe software library and a broad set of trained models in more than 70 languages (Nivre et al., 2016; Straka et al., 2016). Deep learning models, including transformer-based networks are also surfacing for languages less resources than English—see for instance the Italian BERT model AlBERTo (Polignano et al., 2019). Moreover, the multilingual lexicon of offensive words HurtLex (Bassignana et al., 2018) could provide a solid basis to compile lists of swear words in its 53 covered languages.

Finally, let us mention the issue related to the implicit constructions denoting abusive content or multi-word swear constructions possibly present in tweets. The detection of implicitly abusive language, i.e. abusive language that is not conveyed by abusive words, has been recently recognised as one of the most prominent challenges in the field (Wiegand et al., 2021; Caselli et al., 2020). Implicit abuse is in rather direct contrast with lexicon-based methods, based on the use of lists of swear words. Therefore, an interesting direction for future work could be extending our method to determine the abusiveness of non-swearing words, to improve our systems by addressing the possibility to detect toxicity without swear words, where the abusive load is masked by the use of euphemistic constructions and figurative devices, i.e. rhetorical questions, comparisons, metaphors, irony and sarcasm.