Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data

Kogilavani Shanmugavadivel,S Neelakandan,Malliga Subramanian,T Bheema Lingaiah,Sandhiya Raja,V E Sathishkumar

doi:10.1038/s41598-022-26092-3

Kogilavani Shanmugavadivel, S Neelakandan + Show 4 more

Open Access

https://doi.org/10.1038/s41598-022-26092-3

Copy DOI

Abstract

Sentiment analysis is a process in Natural Language Processing that involves detecting and classifying emotions in texts. The emotion is focused on a specific thing, an object, an incident, or an individual. Although some tasks are concerned with detecting the existence of emotion in text, others are concerned with finding the polarities of the text, which is classified as positive, negative, or neutral. The task of determining whether a comment contains inappropriate text that affects either individual or group is called offensive language identification. The existing research has concentrated more on sentiment analysis and offensive language identification in a monolingual data set than code-mixed data. Code-mixed data is framed by combining words and phrases from two or more distinct languages in a single text. It is quite challenging to identify emotion or offensive terms in the comments since noise exists in code-mixed data. The majority of advancements in hostile language detection and sentiment analysis are made on monolingual data for languages with high resource requirements. The proposed system attempts to perform both sentiment analysis and offensive language identification for low resource code-mixed data in Tamil and English using machine learning, deep learning and pre-trained models like BERT, RoBERTa and adapter-BERT. The dataset utilized for this research work is taken from a shared task on Multi task learning DravidianLangTech@ACL2022. Another challenge addressed by this work is the extraction of semantically meaningful information from code-mixed data using word embedding. The result represents an adapter-BERT model gives a better accuracy of 65% for sentiment analysis and 79% for offensive language identification when compared with other trained models.

Full Text