Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language

Shankar Biradar,Sunil Saumya,Abhinav Kumar,Ashish Singh

doi:10.1145/3573199

Abstract

In multilingual societies like India, mixing the native language with English has become common during social media conversations. Further, due to the government’s digitization push, more people from rural India are joining social media platforms, resulting in the exponential growth of native or code-mixed content. The resultant content on social media is available for both positive (also termed as Hope Speech) as well as negative context (also termed as Hate Speech). To keep the social media clean and hate free, it is important to remove the negative content using machine learning filters. Since most of the existing hate content prediction models are trained using high resource language such as English, they fail to work on code-mixed text due to its spelling variance and non-grammatical structure. In addition, the lack of suitable training data could be one reason behind existing models’ poor performance on code-mixed text. To address these issues and promote research in this direction, we developed a manually annotated Hinglish Code-mixed corpus of 9254 comments taken from Twitter handles. We also annotated our data with the target audience and severity level. In each label, we provided a more fine-grained classification with three independent classes, and we built a Multi-label and Multi-class corpus for the severity of hate content prediction in Hinglish code-mixed text. Further, we modeled various supervised classifiers for severity prediction to validate our proposed data. The proposed models employ transformers for feature extraction and different machine learning and RNN (Recurrent neural network) models for classification. According to the experimental results, the target label combined with embeddings from Twitter text using the BiLSTM (a varient of RNN) classifier performed better on severity prediction, attaining an acceptable weighted F1 score.

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language

Abstract

Published Version (Free)

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Dec 7, 2022
Citations: 3

Similar Papers

Neural Machine Translation for Sinhala-English Code-Mixed Text
Archchana Kugathasan ... Sagara Sumathipala
-
Archchana Kugathasan, et. al.Archchana Kugathasan ... Sagara Sumathipala
01 Jan 2020
01 Jan 2020

The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi–English (Code-Mixed) Tweets
Khalil Al-Hussaeni ... Ioannis Karamitsos
Applied Sciences | VOL. 13
Khalil Al-Hussaeni, et. al.Khalil Al-Hussaeni ... Ioannis Karamitsos
09 Oct 2023
Applied Sciences | VOL. 13

Transformer-based approach to classify abusive content in Dravidian Code-mixed text
...
-
, et. al. ...
12 May 2022
12 May 2022

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi
Shashi Shekhar ... Dilip Kumar Sharma
Computación y Sistemas | VOL. 24
Shashi Shekhar, et. al.Shashi Shekhar ... Dilip Kumar Sharma
09 Dec 2020
Computación y Sistemas | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language

Abstract

Published Version (Free)

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing