Sinhala Hate Speech Detection in Social Media Using Machine Learning and Deep Learning

W.S.S Fernando,Ruvan Weerasinghe,E.R.A.D Bandara

doi:10.1109/icter58063.2022.10024082

Abstract

Communication and presentation of beliefs became easier than in previous decades due to the rapid rise of information technology and computer science. Because social media is accessible worldwide via the internet, anyone can simply target someone or a group who adheres to a different culture or belief. While everyone has the freedom to express their own opinions, it should not be destructive, and everyone has the right to be free of hate speech. Because there are no automatic mechanisms for detecting hate speech on social media, anyone can be readily targeted. Because social media service providers do not have extensive linguistic expertise of some languages, such as Sinhala, it may take a few days for them to delete hate-related comments from the material after they become aware of them. As a result, detecting hate speech in the Sinhala language is an urgent and crucial task. Machine learning and deep learning based algorithms were employed in this study to automatically recognize Sinhala hate speeches broadcast on social media. Bag of words, Tf-idf, Word2Vec, and FastText feature extraction methods were used to extract features from the comments. Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine, XGBoost, Random Forest machine learning models and CNN, RNN, LSTM deep learning models were trained using two pre-collected datasets with different sizes. The best six models were then chosen and test set performances were shown. According to this study, FastText with RNN has the greatest AUC ROC 0.71 with 70% accuracy for the test set.

Full Text