ABSTRACT
A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, hand-coded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling.
- Uwe Bretschneider, Thomas Wöhner, and Ralf Peters. 2014. Detecting Online Harassment in Social Networks. (2014).Google Scholar
- Erin E Buckels, Paul D Trapnell, and Delroy L Paulhus. 2014. Trolls just want to have fun. Personality and individual Differences 67 (2014), 97--102.Google Scholar
- Maeve Duggan and Aaron Smith. 2013. Social media update 2013. Pew Internet and American Life Project (2013).Google Scholar
- Claire Hardaker. 2010. Trolling in asynchronous computer-mediated communication: from user discussions to theoretical concepts. Journal of Politeness Research 6, 2 (2010), 215--242.Google ScholarCross Ref
- April Kontostathis, Kelly Reynolds, Andy Garron, and Lynne Edwards. 2013. Detecting cyberbullying: query terms and techniques. In Proceedings of the 5th annual acm web science conference. ACM, 195--204. Google ScholarDigital Library
- Sara Owsley Sood, Elizabeth F Churchill, and Judd Antin. 2012. Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology 63, 2 (2012), 270--285. Google ScholarDigital Library
Index Terms
- A Large Labeled Corpus for Online Harassment Research
Recommendations
Identifying Women's Experiences With and Strategies for Mitigating Negative Effects of Online Harassment
CSCW '17: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social ComputingThe popularity, availability, and ubiquity of information and communication technologies create new opportunities for online harassment. The present study evaluates factors associated with young adult women's online harassment experiences through a ...
Classification and Its Consequences for Online Harassment: Design Insights from HeartMob
Online harassment is a pervasive and pernicious problem. Techniques like natural language processing and machine learning are promising approaches for identifying abusive language, but they fail to address structural power imbalances perpetuated by ...
Online Harassment and Content Moderation: The Case of Blocklists
Online harassment is a complex and growing problem. On Twitter, one mechanism people use to avoid harassment is the blocklist, a list of accounts that are preemptively blocked from interacting with a subscriber. In this article, we present a rich ...
Comments