Hate speech detection in the Indonesian language: A dataset and preliminary study

Ika Alfina,Mohamad Ivan Fanany,Yudo Ekanata,Rio Mulia

doi:10.1109/icacsis.2017.8355039

Abstract

The objective of our work is to detect hate speech in the Indonesian language. As far as we know, the research on this subject is still very rare. The only research we found has created a dataset for hate speech against religion, but the quality of this dataset is inadequate. Our research aimed to create a new dataset that covers hate speech in general, including hatred for religion, race, ethnicity, and gender. In addition, we also conducted a preliminary study using machine learning approach. Machine learning so far is the most frequently used approach in classifying text. We compared the performance of several features and machine learning algorithms for hate speech detection. Features that extracted were word n-gram with n=l and n=2, character n-gram with n=3 and n=4, and negative sentiment. The classification was performed using Naïve Bayes, Support Vector Machine, Bayesian Logistic Regression, and Random Forest Decision Tree. An F-measure of 93.5% was achieved when using word n-gram feature with Random Forest Decision Tree algorithm. Results also show that word n-gram feature outperformed character n-gram.

Full Text