Investigating Sampling Bias in Abusive Language Detection

Dante Razo,Sandra Kübler

doi:10.18653/v1/2020.alw-1.9

Abstract

Abusive language detection is becoming increasingly important, but we still understand little about the biases in our datasets for abusive language detection, and how these biases affect the quality of abusive language detection. In the work reported here, we reproduce the investigation of Wiegand et al. (2019) to determine differences between different sampling strategies. They compared boosted random sampling, where abusive posts are upsampled, and biased topic sampling, which focuses on topics that are known to cause abusive language. Instead of comparing individual datasets created using these sampling strategies, we use the sampling strategies on a single, large dataset, thus eliminating the textual source of the dataset as a potential confounding factor. We show that differences in the textual source can have more effect than the chosen sampling strategy.

Highlights

Abusive language detection has become an important problem, especially in a world where #BlackLivesMatter, and where abusive posts on social media need to be found and deleted automatically
We have investigated the interaction between different sampling strategies with classification results for abusive language detection datasets
We have reproduced the two sampling strategies distinguished by Wiegand et al (2019), boosted random sampling and biased topic sampling, but we applied them to the same dataset, in order to eliminate the differences resulting from the textual sources

Summary

Introduction

Abusive language detection has become an important problem, especially in a world where #BlackLivesMatter, and where abusive posts on social media need to be found and deleted automatically. Wiegand et al (2019) present one of the first investigations into into bias in different datasets for abusive language detection for English. They compare characteristics of 6 datasets, based on their underlying sampling strategy, their proportion of abusive posts, and the proportion of explicit abuse. The remainder of the paper is structured as follows: Section 2 explains our research questions, section 3 provides an overview of related work on bias in abusive language detection data, and section 4 discusses our experimental setup, including datasets, lexicons, sampling strategies, the classifier, and evaluation. We have a closer look at this distinction, along with looking at the OOV rate of instances

Research Questions

How dependent are results on the topic used for sampling?

Related Work

Datasets

Generating Sampling Variants

Data Preprocessing and Features

Results

Classifier

Repeated Subset Sampling

Comparing Random Boosted Sampling and Biased Topic Sampling

Comparing Wide and Narrow Topic Definitions for Biased Topic Sampling

Explicit and Implicit Abuse

Out-of-Vocabulary Rates

Conclusion and Future Work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Investigating Sampling Bias in Abusive Language Detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 20	License type: cc-by

Similar Papers

Measuring and mitigating language model biases in abusive language detection
Rui Song ... Hao Xu
Information Processing and Management | VOL. 60
Rui Song, et. al.Rui Song ... Hao Xu
07 Feb 2023
Information Processing and Management | VOL. 60

Comparative Analysis on Joint Modeling of Emotion and Abuse Detection in Bangla Language
Afridi Ibn Rahman ... Md Asad Uzzaman Noor
-
Afridi Ibn Rahman, et. al.Afridi Ibn Rahman ... Md Asad Uzzaman Noor
01 Jan 2021
01 Jan 2021

Modeling Users and Online Communities for Abuse Detection: A Position on Ethics and Explainability
Pushkar Mishra ... Ekaterina Shutova
-
Pushkar Mishra, et. al.Pushkar Mishra ... Ekaterina Shutova
01 Jan 2020
01 Jan 2020

Joint Modelling of Emotion and Abusive Language Detection
Santhosh Rajamanickam ... Ekaterina Shutova
-
Santhosh Rajamanickam, et. al.Santhosh Rajamanickam ... Ekaterina Shutova
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Investigating Sampling Bias in Abusive Language Detection

Abstract

Highlights

Summary

Talk to us

Similar Papers