A Malay Language Cyberbullying Detection Model on Twitter using Supervised Machine Learning

Nurina Farhanah Binti Johari,Juliana Jaafar

doi:10.1109/ivit55443.2022.10033395

Abstract

This research detects cyberbullying for the Malay language using supervised machine learning (ML) and Natural Language Processing (NLP). Due to the high number of cyberbullying cases in Malaysia over the years and the belief that there is an increased number of unreported cyberbullying cases, there needs an intelligent way to detect cyberbullying on social media. Thus, this research explores how supervised ML and NLP can help detect cyberbullying incidents for the Malay language on social media. The dataset was collected from Twitter by scrapping tweets based on some common Malay words used in cyberbullying incidents before being labelled into six cyberbullying classes: appearance, intellectual, political, racial, sexual, and non-abusive. The resulting cyberbullying dataset is an imbalanced dataset with 45,580 tweets. The model is then built using Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF) algorithms combined with three different feature extraction techniques, that is Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec. The result indicates that the best model uses LR combined with the TF-IDF feature extraction technique. The model was improved further by using an oversampling technique (Synthetic Minority Oversampling Technique, SMOTE) to deal with the imbalanced dataset and tuning the model hyperparameters. The F-Score of the optimised TF-IDF – LR is 0.46.

Full Text