Skip to main content

2018 | OriginalPaper | Buchkapitel

Fighting Adversarial Attacks on Online Abusive Language Moderation

verfasst von : Nestor Rodriguez, Sergio Rojas-Galeano

Erschienen in: Applied Computer Sciences in Engineering

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Lack of moderation in online conversations may result in personal aggression, harassment or cyberbullying. Such kind of hostility is usually expressed by using profanity or abusive language. On the basis of this assumption, recently Google has developed a machine-learning model to detect hostility within a comment. The model is able to assess to what extent abusive language is poisoning a conversation, obtaining a “toxicity” score for the comment. Unfortunately, it has been suggested that such a toxicity model can be deceived by adversarial attacks that manipulate the text sequence of the abusive language. In this paper we aim to fight this anomaly; firstly we characterise two types of adversarial attacks, one using obfuscation and the other using polarity transformations. Then, we propose a two–stage approach to disarm such attacks by coupling a text deobfuscation method and the toxicity scoring model. The approach was validated on a dataset of approximately 24000 distorted comments showing that it is feasible to restore the toxicity score of the adversarial variants. We anticipate that combining machine learning and text pattern recognition methods operating on different layers of linguistic features, will help to foster aggression–safe online conversations despite the adversary challenges inherent to the versatile nature of written language.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
2.
Zurück zum Zitat Hosseinmardi, H.: Survey of computational methods in cyberbullying research. In: Proceedings of the First International Workshop on Computational Methods for CyberSafety. ACM, New York (2016) Hosseinmardi, H.: Survey of computational methods in cyberbullying research. In: Proceedings of the First International Workshop on Computational Methods for CyberSafety. ACM, New York (2016)
3.
Zurück zum Zitat Burnap, P., Williams, M.L.: Us and them: identifying cyber hate on Twitter across multiple protected characteristics. EPJ Data Sci. 5(1), 11 (2016)CrossRef Burnap, P., Williams, M.L.: Us and them: identifying cyber hate on Twitter across multiple protected characteristics. EPJ Data Sci. 5(1), 11 (2016)CrossRef
4.
Zurück zum Zitat Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detection in online user content. In: Proceedings of the 25th International Conference on World Wide Web (2016) Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detection in online user content. In: Proceedings of the 25th International Conference on World Wide Web (2016)
5.
6.
Zurück zum Zitat Hosseini, H., Kannan, S., Zhang, B., Poovendran, R.: Deceiving google’s perspective API built for detecting toxic comments. arXiv preprint arXiv:1702.08138, February 2017 Hosseini, H., Kannan, S., Zhang, B., Poovendran, R.: Deceiving google’s perspective API built for detecting toxic comments. arXiv preprint arXiv:​1702.​08138, February 2017
8.
Zurück zum Zitat Laskov, P., Lippmann, R.: Machine learning in adversarial environments. Mach. Learn. 81(2), 115–119 (2010)CrossRef Laskov, P., Lippmann, R.: Machine learning in adversarial environments. Mach. Learn. 81(2), 115–119 (2010)CrossRef
12.
Zurück zum Zitat Stone, T.E., McMillan, M., Hazelton, M.: Back to swear one: a review of English language literature on swearing and cursing in western health settings. Aggress. Violent Behav. 25, 65–74 (2015)CrossRef Stone, T.E., McMillan, M., Hazelton, M.: Back to swear one: a review of English language literature on swearing and cursing in western health settings. Aggress. Violent Behav. 25, 65–74 (2015)CrossRef
13.
Metadaten
Titel
Fighting Adversarial Attacks on Online Abusive Language Moderation
verfasst von
Nestor Rodriguez
Sergio Rojas-Galeano
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-00350-0_40