Comparative Performance of Machine Learning Algorithms in Detecting Offensive Speech in Malayalam-English Code-Mixed Data

L K Dhanya,Kannan Balakrishnan

doi:10.1007/978-981-19-1018-0_59

Abstract

AbstractOffensive speech identification in social media communication has risen to the top of the priority list for avoiding confrontations and curtailing unwanted behaviour. Hate speech identification becomes difficult in a context, where multilingual speakers fluctuate between various languages, making algorithms built for monolingual corpora inadequate. We intend to undertake a comparative analysis of hate speech in a code-mixed social media text as part of our research. We created a standard Malayalam-English code-mixed dataset that may be utilised to detect hate speech and abuse in this article. We used five machine learning algorithms: support vector machine, logistic regression, K-nearest neighbour, random forest and XGBoost in this study and tuned all the models with hyper-parameters. The study finishes by analysing these five models using different performance metrics and then calculating the various parameters to find the optimum model. XGBoost achieved good results with 80% accuracy and with high precision, recall and F1-score.KeywordsOffensive speechMachine learningXGBoostGrid searchSMOTESVMRandom forestKNNLogistic regression

Full Text