A Transformer Based Approach for Abuse Detection in Code Mixed Indic Languages.

Vibhuti Bansal,Vedika Gupta,Qin Xin,Mrinal Tyagi,Rajesh Sharma

doi:10.1145/3571818

Abstract

The advancement in the number of online social media platforms has entailed active participation from the web users globally. This has also lead to subsequent increase in the cyberbullying cases online. Such incidents diminish an individual’s reputation or defame a community, also posing a threat to the privacy of users in cyberspace. Traditionally, manual checks and handling mechanisms have been used to deal with such textual content. However, an automatic computer-based approach would provide far better solutions to this problem. Existing approaches to automate this task majorly involves classical machine learning models which tend to perform poorly on low resource languages. Owing to the varied background and language of web users, the cyberspace witnesses the presence of multilingual text. An integrated approach to accommodate multilingual text could be the appropriate solution. This paper explores various methods to detect abusive content in 13 Indic code-mixed languages. Firstly, baseline classical machine learning models are compared with Transformer based architecture. Secondly, the paper presents the experimental analysis of four state-of-the-art transformer-based models vis à vis XLM-RoBERTa, indic-BERT, MurilBert and mBERT, out of which XLM Roberta with BiGRU outperforms. Thirdly, the experimental setup of the best performing model XLM-RoBERTa is fed with emoji embeddings that leads to further enhancement of overall performance of the employed model. Finally, the model is trained with the combined dataset of 13 Indic languages, to compare its performance with those of individual language models. The performance of combined model surpassed those of the individual models in terms of F1 score and accuracy, supporting the fact that combined model fits the data better possibly due to its code-mixed nature. This model reports a F1 score of 0.88 on test data while rendering a training loss of 0.28, validation loss of 0.31 and an AUC score of 0.94 for both training and validation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Transformer Based Approach for Abuse Detection in Code Mixed Indic Languages.

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Nov 23, 2022
Citations: 7

Similar Papers

Exploring Neural Embeddings and Transformers for Isolation of Offensive and Hate Speech in South African Social Media Space
Oluwafemi Oriola ... Eduan Kotzé
-
Oluwafemi Oriola, et. al.Oluwafemi Oriola ... Eduan Kotzé
01 Jan 2021
01 Jan 2021

Hyper‐parametric improved machine learning models for solar radiation forecasting
Mantosh Kumar ... Neha Kumari
Concurrency and Computation: Practice and Experience | VOL. 34
Mantosh Kumar, et. al.Mantosh Kumar ... Neha Kumari
26 Jul 2022
Concurrency and Computation: Practice and Experience | VOL. 34

Information-Theoretic Bounds on Quantum Advantage in Machine Learning.
Hsin-Yuan Huang ... John Preskill
Physical Review Letters | VOL. 126
Hsin-Yuan Huang, et. al.Hsin-Yuan Huang ... John Preskill
14 May 2021
Physical Review Letters | VOL. 126

A comparison of methods for fully automatic segmentation of tumors and involved nodes in PET/CT of head and neck cancers
Aurora Rosvoll Groendahl ... Eirik Malinen
Physics in Medicine & Biology | VOL. 66
Aurora Rosvoll Groendahl, et. al.Aurora Rosvoll Groendahl ... Eirik Malinen
04 Mar 2021
Physics in Medicine & Biology | VOL. 66

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Transformer Based Approach for Abuse Detection in Code Mixed Indic Languages.

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing