Impact of Data Augmentation on the Hate Speech Detection in Portuguese Language

Félix Silva,Artur Cerri,Ulisses Brisolara Corrêa,Larissa A De Freitas

doi:10.32473/flairs.37.1.135307

Abstract

Online communities allow users to establish a web presence, manage their identities, and stay connected with others. The internet has facilitated global outreach with just a click on the World Wide Web. However, the current landscape of online social media platforms are marred by various issues, with hate speech prominently taking center stage. Hate speech is characterized by hostile and malicious language driven by prejudice, targeting individuals or groups based on their innate, natural, or perceived characteristics. Detecting such speech is crucial for maintaining a safe online environment. This study examines the impact of dataset regularization techniques on the performance of BERTimbau-based models when applied to four Portuguese hate speech datasets: Fortuna et al. (2019), OFFCOMBR-2, ToLD-BR, and Hate-BR. Four Data Augmentation techniques are evaluated: Oversampling, Undersampling, Text Augmentation, and Synonym Replacement. Our experiments revealed that, apart from the Fortuna et al. (2019) dataset, the Data Augmentation techniques did not significantly enhance the performance of hate speech detection tasks.

Full Text