Abstract

Abusive language detection is becoming increasingly important, but we still understand little about the biases in our datasets for abusive language detection, and how these biases affect the quality of abusive language detection. In the work reported here, we reproduce the investigation of Wiegand et al. (2019) to determine differences between different sampling strategies. They compared boosted random sampling, where abusive posts are upsampled, and biased topic sampling, which focuses on topics that are known to cause abusive language. Instead of comparing individual datasets created using these sampling strategies, we use the sampling strategies on a single, large dataset, thus eliminating the textual source of the dataset as a potential confounding factor. We show that differences in the textual source can have more effect than the chosen sampling strategy.

Highlights

  • Abusive language detection has become an important problem, especially in a world where #BlackLivesMatter, and where abusive posts on social media need to be found and deleted automatically

  • We have investigated the interaction between different sampling strategies with classification results for abusive language detection datasets

  • We have reproduced the two sampling strategies distinguished by Wiegand et al (2019), boosted random sampling and biased topic sampling, but we applied them to the same dataset, in order to eliminate the differences resulting from the textual sources

Read more

Summary

Introduction

Abusive language detection has become an important problem, especially in a world where #BlackLivesMatter, and where abusive posts on social media need to be found and deleted automatically. Wiegand et al (2019) present one of the first investigations into into bias in different datasets for abusive language detection for English. They compare characteristics of 6 datasets, based on their underlying sampling strategy, their proportion of abusive posts, and the proportion of explicit abuse. The remainder of the paper is structured as follows: Section 2 explains our research questions, section 3 provides an overview of related work on bias in abusive language detection data, and section 4 discusses our experimental setup, including datasets, lexicons, sampling strategies, the classifier, and evaluation. We have a closer look at this distinction, along with looking at the OOV rate of instances

Research Questions
How dependent are results on the topic used for sampling?
Related Work
Datasets
Generating Sampling Variants
Data Preprocessing and Features
Results
Classifier
Repeated Subset Sampling
Comparing Random Boosted Sampling and Biased Topic Sampling
Comparing Wide and Narrow Topic Definitions for Biased Topic Sampling
Explicit and Implicit Abuse
Out-of-Vocabulary Rates
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call