A Large Labeled Corpus for Online Harassment Research

Authors:
Jennifer Golbeck

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Zahra Ashktorab

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Rashad O. Banjo

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Alexandra Berlinger

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Siddharth Bhagwan

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Cody Buntain

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Paul Cheakalos

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Alicia A. Geller

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Quint Gergory

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Rajesh Kumar Gnanasekaran

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Raja Rajan Gunasekaran

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Kelly M. Hoffman

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Jenny Hottle

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Vichita Jienjitlert

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Shivika Khare

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Ryan Lau

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Marianna J. Martindale

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Shalmali Naik

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Heather L. Nixon

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Piyush Ramachandran

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Kristine M. Rogers

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Lisa Rogers

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Meghna Sardana Sarin

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Gaurav Shahane

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Jayanee Thanki

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Priyanka Vengataraman

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Zijian Wan

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Derek Michael Wu

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

WebSci '17: Proceedings of the 2017 ACM on Web Science ConferenceJune 2017Pages 229–233https://doi.org/10.1145/3091478.3091509

Published:25 June 2017Publication History

WebSci '17: Proceedings of the 2017 ACM on Web Science Conference

Pages 229–233

ABSTRACT

A fundamental part of conducting cross-disciplinary web science research is having useful, high-quality datasets that provide value to studies across disciplines. In this paper, we introduce a large, hand-coded corpus of online harassment data. A team of researchers collaboratively developed a codebook using grounded theory and labeled 35,000 tweets. Our resulting dataset has roughly 15% positive harassment examples and 85% negative examples. This data is useful for training machine learning models, identifying textual and linguistic features of online harassment, and for studying the nature of harassing comments and the culture of trolling.

References

Uwe Bretschneider, Thomas Wöhner, and Ralf Peters. 2014. Detecting Online Harassment in Social Networks. (2014).Google Scholar
Erin E Buckels, Paul D Trapnell, and Delroy L Paulhus. 2014. Trolls just want to have fun. Personality and individual Differences 67 (2014), 97--102.Google Scholar
Maeve Duggan and Aaron Smith. 2013. Social media update 2013. Pew Internet and American Life Project (2013).Google Scholar
Claire Hardaker. 2010. Trolling in asynchronous computer-mediated communication: from user discussions to theoretical concepts. Journal of Politeness Research 6, 2 (2010), 215--242.Google ScholarCross Ref
April Kontostathis, Kelly Reynolds, Andy Garron, and Lynne Edwards. 2013. Detecting cyberbullying: query terms and techniques. In Proceedings of the 5th annual acm web science conference. ACM, 195--204. Google ScholarDigital Library
Sara Owsley Sood, Elizabeth F Churchill, and Judd Antin. 2012. Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology 63, 2 (2012), 270--285. Google ScholarDigital Library

Index Terms

A Large Labeled Corpus for Online Harassment Research
1. Networks
  1. Network types
    1. Overlay and other logical network structures
      1. Social media networks

Recommendations

Identifying Women's Experiences With and Strategies for Mitigating Negative Effects of Online Harassment
CSCW '17: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing

The popularity, availability, and ubiquity of information and communication technologies create new opportunities for online harassment. The present study evaluates factors associated with young adult women's online harassment experiences through a ...
Read More
Classification and Its Consequences for Online Harassment: Design Insights from HeartMob

Online harassment is a pervasive and pernicious problem. Techniques like natural language processing and machine learning are promising approaches for identifying abusive language, but they fail to address structural power imbalances perpetuated by ...
Read More
Online Harassment and Content Moderation: The Case of Blocklists

Online harassment is a complex and growing problem. On Twitter, one mechanism people use to avoid harassment is the blocklist, a list of accounts that are preemptively blocked from interacting with a subscriber. In this article, we present a rich ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WebSci '17: Proceedings of the 2017 ACM on Web Science Conference
June 2017
438 pages
ISBN:9781450348966
DOI:10.1145/3091478
Conference Chairs:
Peter Fox
Rensselaer Polytechnic Institute, USA
,
Deborah McGuinness
Rensselaer Polytechnic Institute, USA
,
Lindsay Poirer
Rensselaer Polytechnic Institute, USA
,
Program Chairs:
Paolo Boldi
Universita degli Studi di Milano, Italy
,
Katharina Kinder-Kurlanda
GESIS - Leibniz Institute for the Social Sciences, Germany
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
datasets
online harassment
Qualifiers
- short-paper
Conference

Acceptance Rates
WebSci '17 Paper Acceptance Rate30of85submissions,35%Overall Acceptance Rate218of875submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 82
  Total Citations
  View Citations
- 3,420
  Total Downloads
- Downloads (Last 12 months)1,378
- Downloads (Last 6 weeks)57
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Large Labeled Corpus for Online Harassment Research

WebSci '17: Proceedings of the 2017 ACM on Web Science Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Identifying Women's Experiences With and Strategies for Mitigating Negative Effects of Online Harassment

Classification and Its Consequences for Online Harassment: Design Insights from HeartMob

Online Harassment and Content Moderation: The Case of Blocklists