Sinhala Hate Speech Detection in Social Media using Text Mining and Machine learning

H.M.S.T Sandaruwan,S.A.S Lorensuhewa,M.A.L Kalyani

doi:10.1109/icter48817.2019.9023655

Abstract

With the rapid growth of Information technology and Computer Science, communication and presenting ideologies became easier than early decades. Since Social Media are available globally through the web, anyone can easily target a person or a group who belongs to a different culture or a different belief. Though everyone has a right to express his or her own ideas, it should not be harmful, as everyone has a right to be prevented from any kind of hate speeches. In Social Media, there are no automatic methods to detect a hate speech, so anyone can easily be targeted. Since social media service providers do not have good linguistic knowledge on some languages such as Sinhala, they may take a couple of days to remove hate related comments from the content once they noticed. Therefore, hate speech detection in Sinhala language is an urgent and important work to address. We propose lexicon based and machine learning based approaches to automatically detect Sinhala hate and offensive speeches that are being shared through Social Media. In our study, lexicon based approach was initiated with the lexicon generating process and corpus based lexicon gave 76.3% of accuracy for hate, offensive and neutral speech detection. Machine learning approach was begun with building a 3000 comments corpus which is evenly distributed among hate, offensive and neutral speeches. Using this comment corpus, we were able to identify best fitting feature groups and models for Sinhala hate speech detection. According to our experiments, character trigram with Multinomial Naive Bayes gave the highest recall value as 0.84 with 92.33% accuracy.

Full Text