Feature Expansion Using Word2vec for Hate Speech Detection on Indonesian Twitter with Classification Using SVM and Random Forest

Mila Putri Kartika Dewi,Erwin Budi Setiawan

doi:10.30865/mib.v6i2.3855

Mila Putri Kartika Dewi, Erwin Budi Setiawan

Open Access

https://doi.org/10.30865/mib.v6i2.3855

Copy DOI

Journal: JURNAL MEDIA INFORMATIKA BUDIDARMA	Publication Date: Apr 25, 2022
Citations: 2	License type: CC BY 4.0

Affiliation: Telkom University

Abstract

Hate speech is one of the most common cases on Twitter. It is limited to 280 characters in uploading tweets, resulting in many word variations and possible vocabulary mismatches. Therefore, this study aims to overcome these problems and build a hate speech detection system on Indonesian Twitter. This study uses 20,571 tweet data and implements the Feature Expansion method using Word2vec to overcome vocabulary mismatches. Other methods applied are Bag of Word (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF) to represent feature values in tweets. This study examines two methods in the classification process, namely Support Vector Machine (SVM) and Random Forest (RF). The final result shows that the Feature Expansion method with TF-IDF weighting in the Random Forest classification gives the best accuracy result, which is 88,37%. The Feature Expansion method with TF-IDF weighting can increase the accuracy value from several tests in detecting hate speech and overcoming vocabulary mismatches.

Full Text