UHated: hate speech detection in Urdu language using transfer learning

Muhammad Umair Arshad,Waseem Shahzad,Raza Ali,Mirza Omer Beg

doi:10.1007/s10579-023-09642-7

Abstract

Social media has become a driving force for social change in the global society. Events that take place in one part of the world can quickly reverberate across the globe due to the vast amount of data generated on these platforms. However, developers of these platforms face numerous challenges in keeping cyberspace as inclusive and healthy as possible. In recent years, there has been an increase in offensive and hate speech on social media. Manual efforts to address this issue have been inadequate due to the vast scope of the problem. Therefore, there is a need for an automated technique that can detect and remove offensive and hateful comments before they can cause harm. In this research, we use transfer learning to utilize pre-trained FastText Urdu word embeddings and multi-lingual BERT embeddings (RoBERTa) for our task. We also develop an Urdu language hate lexicon and use it to create an annotated dataset of 7800 Urdu tweets. Our results show that RoBERTa is able to achieve a macro F1-score of 0.82 on our multi-class classification task, outperforming deep learning and machine learning baseline models.

Full Text