Abstract

The advancement in the number of online social media platforms has entailed active participation from the web users globally. This has also lead to subsequent increase in the cyberbullying cases online. Such incidents diminish an individual’s reputation or defame a community, also posing a threat to the privacy of users in cyberspace. Traditionally, manual checks and handling mechanisms have been used to deal with such textual content. However, an automatic computer-based approach would provide far better solutions to this problem. Existing approaches to automate this task majorly involves classical machine learning models which tend to perform poorly on low resource languages. Owing to the varied background and language of web users, the cyberspace witnesses the presence of multilingual text. An integrated approach to accommodate multilingual text could be the appropriate solution. This paper explores various methods to detect abusive content in 13 Indic code-mixed languages. Firstly, baseline classical machine learning models are compared with Transformer based architecture. Secondly, the paper presents the experimental analysis of four state-of-the-art transformer-based models vis à vis XLM-RoBERTa, indic-BERT, MurilBert and mBERT, out of which XLM Roberta with BiGRU outperforms. Thirdly, the experimental setup of the best performing model XLM-RoBERTa is fed with emoji embeddings that leads to further enhancement of overall performance of the employed model. Finally, the model is trained with the combined dataset of 13 Indic languages, to compare its performance with those of individual language models. The performance of combined model surpassed those of the individual models in terms of F1 score and accuracy, supporting the fact that combined model fits the data better possibly due to its code-mixed nature. This model reports a F1 score of 0.88 on test data while rendering a training loss of 0.28, validation loss of 0.31 and an AUC score of 0.94 for both training and validation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.