Towards comprehensive cyberbullying detection: A dataset incorporating aggressive texts, repetition, peerness, and intent to harm

Naveed Ejaz,Fakhra Razi,Salimur Choudhury

doi:10.1016/j.chb.2023.108123

Abstract

The increasing usage of social media networks has raised concerns about the growing frequency of cyberbullying incidents. The definition of cyberbullying lacks universal consensus, yet according to several authors, cyberbullying is characterized by aggressive, repetitive, and intentional communication among peers. However, existing cyberbullying detection datasets often focus solely on classifying texts as aggressive or non-aggressive, neglecting the other cyberbullying aspects, thus hindering research progress. This paper proposes a framework for designing a new dataset incorporating all four aspects of cyberbullying to address this gap. The text messages are sourced from a real dataset, while the users’ data is generated synthetically. The resulting dataset contains messages exchanged randomly among different pairs of users, thus inculcating repetition. Additionally, the degree of peerness, defined and calculated to measure the likelihood of two users being peers, is used. The intent of harm is quantified as a numeric value using the ratios of aggression and repetition. As a result, the proposed dataset encompasses all four aspects of cyberbullying by providing repeated aggressive messages among users along with quantitative values of the degree of peerness and intent to harm. The proposed dataset is adaptable, with adjustable threshold values for peerness, repetition, and intent to harm, offering flexibility for various applications. The paper concludes by presenting the results of some baseline machine-learning methods on the proposed dataset.

Full Text