Comparative Analysis of Word Embeddings for Multiclass Cyberbullying Detection

Azhi Faraj,Semih Utku

doi:10.21928/uhdjst.v8n1y2024.pp55-63

Abstract

Cyberbullying has emerged as a pervasive concern in modern society, particularly within social media platforms. This phenomenon encompasses employing digital communication to instill fear, threaten, harass, or harm individuals. Given the prevalence of social media in our lives, there is an escalating need for effective methods to detect and combat cyberbullying. This paper aims to explore the utilization of word embeddings and to discern the comparative effectiveness of trainable word embeddings, pre-trained word embeddings, and fine-tuned language models in multiclass cyberbullying detection. Distinguishing from previous binary classification methods, our research delves into nuanced multiclass detection. The exploration of word embeddings holds significant promise due to its ability to transform words into dense numerical vectors within a high-dimensional space. This transformation captures intricate semantic and syntactic relationships inherent in language, enabling machine learning (ML) algorithms to discern patterns that might signify cyberbullying. In contrast to previous research, this work delves beyond primary binary classification and centers on the nuanced realm of multiclass cyberbullying detection. The research employs diverse techniques, including convolutional neural networks and bidirectional long short-term memory, alongside well-known pre-trained models such as word2vec and bidirectional encoder representations from transformers (BERT). Moreover, traditional ML algorithms such as K-nearest neighbors, Random Forest, and Naïve Bayes are integrated to evaluate their performance vis-à-vis deep learning models. The findings underscore the promise of a fine-tuned BERT model on our dataset, yielding the most promising results in multiclass cyberbullying detection, and achieving the best-recorded accuracy of 85% on the dataset.

Full Text