DHOT-Repository and Classification of Offensive Tweets in the Hindi Language

Vikas Kumar Jha,Vinu P N,Vishnu Vijayan,Prabaharan P,Hrudya P

doi:10.1016/j.procs.2020.04.252

Vikas Kumar Jha, Vinu P N + Show 3 more

Open Access

https://doi.org/10.1016/j.procs.2020.04.252

Copy DOI

Abstract

Abstract While social media gives people an online platform for expressing their views, knowledge, experiences and emotions, a major problem occurs when social media interactions becomes a platform for abusive remarks, comments and conversations. Apart from slurs being offensive in conversations, slurs vary in usage to express contempt, difference of opinions, and in some cases humor. Abusive language can potentially be used to offend someone, to promote racism, sexism, etc. Hindi is the third most popular language in the world, based on the number of speakers globally. It is spoken by millions of Indians from different regional influences and linguistic preferences it has become very rich in it’s diversity and usage. While Hinglish (Hindi written in Roman script instead of the native Devanagari) is extensively used online, native Hindi speakers who write in Devanagari are on a steady rise. Despite this, little research has been done on the use of Hindi as an online language. This paper presents a model to distinguish and then classify offensive text from non-offensive using a fast Text-based model The model was able to classify text from a Devanagari Hindi Offensive Tweets (DHOT) data corpus. A grid-search method was applied to tune hyperparameters during fast Text model runs, and provided interesting insights on the model accuracy and precision. Our fast Text model achieved 92.2% accuracy employing desktop class machine for the processing. To our knowledge, this is the first attempt to establish a state of the art classification of offensive text in Hindi using fast Text models.

Full Text