Data-driven Approach for Selection of an Ensemble Model of Profane Words Detection in Social Media

Abdulrehman A Mohamed,Michael Kimwele,George Okeyo

doi:10.1109/icabcd51485.2021.9519372

Abstract

Social media has become a mainstream tool for social engagement, entertainment, and even news sharing. As a result, many people are subscribed to one or multiple social media platforms. Like all social platforms with diverse members, there is need for clean conversation devoid of bullying, profanity and related social ills. Social media platforms require scalable, robust and accurate techniques for the detection of profane words in order to minimize their use and publication. Despite efforts being made to develop techniques for profane word detection, poor accuracy remains a challenge. Therefore, to address the challenge, this paper develops a data-driven approach for algorithm selection, herein called the Data-driven Ensemble Model (DEM). The approach consists of algorithm tuning, extreme feature engineering, crowd-sourcing analytics, and ensemble method selection. Raw data from Twitter social media platform was used to censor profane words and WEKA data mining software was used to conduct evaluation experiments. The experiments covered data preprocessing; crowd-sourcing evaluation; ensemble method selection; algorithm tuning, and model building and performance evaluation. The results of these evaluation experiments showed the Data-driven Ensemble Model (DEM) had a better average accuracy of 94.94% compared to baseline method Support Vector Machines (SVM) of 93.33 % on three profane words datasets.

Full Text