Identification of Abusive Sinhala Comments in Social Media using Text Mining and Machine Learning Techniques

H M S T Sandaruwan,M A L Kalyani,S A S Lorensuhewa

doi:10.4038/icter.v13i1.7213

H M S T Sandaruwan, M A L Kalyani + Show 1 more

Open Access

https://doi.org/10.4038/icter.v13i1.7213

Copy DOI

Abstract

With the technology revolution, most of the natural languages that are used all over the world have won the digital world. Therefore, people use modern technologies such as Social Media and the Internet with their native languages. As a result, people who are with self-ego on their tradition, race, caste, religion and other social factors, tend to make abusiveness on others who do not belong to the same social group by their native languages. Since the Social Media platforms do not have centralized control, it has become a good platform to advertise their backward ideas without being governed and monitored. The Sinhala language has also been added to most famous Social Media platforms. Though the Sinhala language has more than 2500 years of history, it does not have rich resources for computer-based natural language processing. Therefore, it has been a very difficult task to automatically detect Sinhala abusive comments which are being published and shared among Social Media platforms. Therefore, here, we have used evenly distributed 2000 Sinhala comment corpus among offensive and neutral classes to train three different models: Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM) and, Random Forest Decision Tree (RFDT) and the features were extracted from Bag of Word model, word n-gram model, character n-gram model, and word skip-gram model to automatically detect Sinhala abusive comments. After the training process, each model was tested with 200 evenly distributed comment corpus and MNB showed the highest accuracy of 96.5% with 96% average recall for both character tri-gram and character four-gram models. Further, two lexicon-based approaches called cross-lingual lexicon approach and corpus-based lexicon approach were considered to detect Sinhala abusive comments. From these two lexicon based approaches, the corpus-based lexicon gave the highest accuracy of 90.5% with an average recall of 90.5%.

Highlights

With the rapid growth of technology, traditional communication media such as newspapers, radio channels, and television have been invaded by the Internet-based communication media
According to the performance measurements, we can conclude that all the feature groups that we considered in character n-gram are good at Sinhala abusive comments identification process as it gives higher Precision, Recall, and F1-Score values regardless of the model that we used to train
When we compare the precision, recall, F1-score, and accuracy values of the Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM), Random Forest Decision Tree (RFDT) classifiers from the TABLE XXI, it is clear that the MNB has outperformed the other two models for the character n-gram model

Summary

Introduction

With the rapid growth of technology, traditional communication media such as newspapers, radio channels, and television have been invaded by the Internet-based communication media. Since these new Internet-based communication media make person based telecommunication platforms, people who do not have a space to address the mass community, welcome the Social Media platforms. Recommended by Dr D.D. Karunaratna on 02 April 2020. This paper is an extended version of the paper “Sinhala Hate Speech Detection in Social Media using Text Mining and Machine learning” presented at the ICTer 2019. Almost all the people around the world are connected by some kind of Social Media

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal on Advances in ICT for Emerging Regions (ICTer)	Publication Date: Jan 29, 2020
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Identification of Abusive Sinhala Comments in Social Media using Text Mining and Machine Learning Techniques

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal on Advances in ICT for Emerging Regions (ICTer)

Lead the way for us

Similar Papers

FastText ve Kelime Çantası Kelime Temsil Yöntemlerinin Turistik Mekanlar İçin Yapılan Türkçe İncelemeler Kullanılarak Karşılaştırılması
Muhammed Çağrı Aksu ... Ersin Karaman
European Journal of Science and Technology | VOL. -
Muhammed Çağrı Aksu, et. al.Muhammed Çağrı Aksu ... Ersin Karaman
13 Oct 2020
European Journal of Science and Technology | VOL. -

Going Viral: The 3 Rs of Social Media Messaging during Public Health Emergencies.
Bhavini Patel Murthy ... Tanya Telfair Leblanc
Health security | VOL. 19
Bhavini Patel Murthy, et. al.Bhavini Patel Murthy ... Tanya Telfair Leblanc
01 Feb 2021
Health security | VOL. 19

Sentiment Analysis of Social Media Comments in Mauritius
Nuzhah Gooda Sahib ... Baby Gobin-Rahimbux
-
Nuzhah Gooda Sahib, et. al.Nuzhah Gooda Sahib ... Baby Gobin-Rahimbux
08 Mar 2023
08 Mar 2023

Twitter Archives and the Challenges of "Big Social Data" for Media and Communication Research
Jean Burgess ... Axel Bruns
M/C Journal | VOL. 15
Jean Burgess, et. al.Jean Burgess ... Axel Bruns
11 Oct 2012
M/C Journal | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identification of Abusive Sinhala Comments in Social Media using Text Mining and Machine Learning Techniques

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal on Advances in ICT for Emerging Regions (ICTer)