Abstract

Freedom of speech for the people of Indonesia on social media makes the spread of hate speech and abusive language inevitable. If there is no proper handling, this will lead to social disharmony between individuals and communities. The identification of hate speech and abusive language on Twitter in the Indonesian language is quite challenging. Because of its ability to understand the meaning of a sentence, semantic features such as word embedding can be relied on to understand tweets that contain hateful and abusive words. In this study, word embedding (word2vec) feature and its combinations with part of speech and/or emoji were used to identify hate speech and abusive language on Twitter in the Indonesian language. Furthermore, some combinations of unigram with part of speech and/or emojis were also utilized during the experiment and the results were studied. The classification algorithms used in this study were Support Vector Machine, Random Forest Decision Tree, and Logistic Regression. The combination of unigram features, part of speech and emoji obtained the highest accuracy value of 79.85% with F-Measure of 87.51%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call