In today's digital era, social media has become a new tool for communication and sharing information, with the availability of high-speed internet it tends to reach the masses much faster. Lack of regulations and ethics have made advancement in the proliferation of abusive language and hate speech has become a growing concern on social media platforms in the form of posts, replies, and comments towards individuals, groups, religions, and communities. However, the process of classification of hate speech manually on online platforms is cumbersome and impractical due to the excessive amount of data being generated. Therefore, it is crucial to automatically filter online content to identify and eliminate hate speech from social media. Widely spoken resource-rich languages like English have driven the research and achieved the desired result due to the accessibility of large corpora, annotated datasets, and tools. Resource-constrained languages are not able to achieve the benefits of advancement due to a lack of data corpus and annotated datasets. India has diverse languages that change with demographics and languages that have limited data availability and semantic differences. Telugu is one of the low-resource Dravidian languages spoken in the southern part of India.In this paper, we present a monolingual Telugu corpus consisting of tweets posted on Twitter annotated with hate and non-hate labels and experiments to provide a comparison of state-of-the-art fine-tuned deep learning models (mBERT, DistilBERT, IndicBERT, NLLB, Muril, RNN+LSTM, XLM-RoBERTa, and Indic-Bart). Through transfer learning and hyperparameter tuning, the models are compared for their effectiveness in classifying hate speech in Telugu text. The fine-tuned mBERT model outperformed all other fine-tuned models achieving an accuracy of 98.2. The authors also propose a deployment model for social media accounts.
Read full abstract