Abstract
The worrisome increase in hate speech in Bangladesh, driven by the rapidly expanding social media user base, has adversely affected cybersecurity and online safety. Most hate speech detection technologies overlook less commonly spoken languages, such as Bangla, in favour of more widely used ones. This project aims to address the gap by identifying hate speech in Bangla through the utilisation of sophisticated word embedding techniques (Word2Vec, GloVe, and FastText) and machine learning algorithms (Multinomial Naive Bayes, Random Forest, K-Nearest Neighbours, and Extreme Gradient Boosting). We used a dataset of 30,000 annotated messages, encompassing both positive and hate speech, to train and test the models. FastText combined with hybrid RF and SVM yielded the highest performance among the models, with an accuracy of 96.03%. The system's generalisability and usefulness were evident when it identified hate speech not included in the training data. Enhancing the identification of harmful content in Bangla is essential for the burgeoning digital ecosystem in Bangladesh, hence contributing to cybersecurity. This endeavour establishes a foundation for future study by efficiently employing advanced word embedding and machine learning techniques to identify hate speech in Bangla. It could significantly influence the advancement of more secure digital environments and the overall condition of internet security.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have