Identification of hate speech and abusive language on indonesian Twitter using the Word2vec, part of speech and emoji features

Muhammad Okky Ibrohim,Muhammad Akbar Setiadi,Indra Budi

doi:10.1145/3373477.3373495

Abstract

Freedom of speech for the people of Indonesia on social media makes the spread of hate speech and abusive language inevitable. If there is no proper handling, this will lead to social disharmony between individuals and communities. The identification of hate speech and abusive language on Twitter in the Indonesian language is quite challenging. Because of its ability to understand the meaning of a sentence, semantic features such as word embedding can be relied on to understand tweets that contain hateful and abusive words. In this study, word embedding (word2vec) feature and its combinations with part of speech and/or emoji were used to identify hate speech and abusive language on Twitter in the Indonesian language. Furthermore, some combinations of unigram with part of speech and/or emojis were also utilized during the experiment and the results were studied. The classification algorithms used in this study were Support Vector Machine, Random Forest Decision Tree, and Logistic Regression. The combination of unigram features, part of speech and emoji obtained the highest accuracy value of 79.85% with F-Measure of 87.51%.

Full Text