Comparing pre-trained language models for Spanish hate speech detection

Flor Miriam Plaza-Del-Arco,M Dolores Molina-González,L Alfonso Ureña-López,M Teresa Martín-Valdivia

doi:10.1016/j.eswa.2020.114120

Flor Miriam Plaza-Del-Arco, M Dolores Molina-González + Show 2 more

https://doi.org/10.1016/j.eswa.2020.114120

Copy DOI

Abstract

Nowadays, due to the great uncontrolled content posted daily on the Web, there has also been a huge increase in the dissemination of hate speech worldwide. Social media, blogs and community forums are examples where people are freely allowed to communicate. However, freedom of expression is not always respectful since offensive or insulting language is sometimes used. Social media companies often rely on users and content moderators to report on this type of content. Nevertheless, due to the large amount of content generated every day on the Web, automatic systems based on Natural Language Processing techniques are required for identifying abusive language online. To date, most of the systems developed to combat this problem are mainly focused on English content, but this issue is a worldwide concern and therefore other languages such as Spanish are involved. In this paper, we address the task of Spanish hate speech identification on social media and provide a deeper understanding of the capabilities of new techniques based on machine learning. In particular, we compare the performance of Deep Learning methods with recently pre-trained language models based on Transfer Learning as well as with traditional machine learning models. Our main contribution is the achievement of promising results in Spanish by applying multilingual and monolingual pre-trained language models such as BERT, XLM and BETO.

Full Text