Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model

Vaibhav Rupapara,Gyu Sang Choi,Hina Fatima Shahzad,Furqan Rustam,Arif Mehmood,Imran Ashraf

doi:10.1109/access.2021.3083638

Vaibhav Rupapara, Gyu Sang Choi + Show 4 more

Open Access

PDF Available

https://doi.org/10.1109/access.2021.3083638

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Social media platforms and microblogging websites have gained accelerated popularity during the past few years. These platforms are used for expressing views and opinions about products, personalities, and events. Often during discussions and debates, fights take place on social media platforms which involves using rude, disrespectful, and hateful comments called toxic comments. The identification of toxic comments has been regarded as an essential element for social media platforms. This study introduces an ensemble approach, called regression vector voting classifier (RVVC), to identify the toxic comments on social media platforms. The ensemble merges the logistic regression and support vector classifier under soft voting criteria. Several experiments are performed on the imbalanced and balanced dataset to analyze the performance of the proposed approach. For data balance, the synthetic minority oversampling technique (SMOTE) is used on the imbalanced dataset. Furthermore, two feature extraction approaches are utilized to investigate their suitability such as term frequency-inverse document frequency (TF-IDF) and bag-of-words (BoW). The performance of the proposed approach is compared with several machine learning classifiers using accuracy, precision, recall, and F1-score. Results suggest that RVVC outperforms all other individual models when TF-IDF features are used with SMOTE balanced dataset and achieves an accuracy of 0.97.

Highlights

Social media platforms and microblogging websites have gained accelerated popularity for social communication between individuals and groups
Results suggest that regression vector voting classifier (RVVC) gives the highest number of correct predictions when used with term frequency-inverse document frequency (TF-Inverse Document Frequency (IDF)) features from synthetic minority oversampling technique (SMOTE) over-sampled dataset
This study analyzes the performance of various machine learning models to perform toxic comments classification and proposes an ensemble approached called RVVC

Summary

Introduction

Social media platforms and microblogging websites have gained accelerated popularity for social communication between individuals and groups. Through these platforms, people share their thoughts, ideas, opinions and express their feelings using comments and feedback [1]. Text in online comments contain many hazards such as fake news, cyberbullying, online harassment and toxicity [4]. These toxic comments have become a serious issue that affects the reputation of social platforms and cause different psychological problems for users, such as depression, frustration, and even suicidal thoughts [1].

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 116	License type: CC BY 4.0

R Discovery Prime

Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering
Muhammad Mujahid ... Imran Ashraf
Journal of Big Data | VOL. 11
Muhammad Mujahid, et. al.Muhammad Mujahid ... Imran Ashraf
17 Jun 2024
Journal of Big Data | VOL. 11

Performance Evaluation of Sentiment Analysis on Balanced and Imbalanced Dataset Using Ensemble Approach
Shini George ... V Srividhya
Indian Journal of Science and Technology | VOL. 15
Shini George, et. al.Shini George ... V Srividhya
05 May 2022
Indian Journal of Science and Technology | VOL. 15

A Malay Language Cyberbullying Detection Model on Twitter using Supervised Machine Learning
Nurina Farhanah Binti Johari ... Juliana Jaafar
-
Nurina Farhanah Binti Johari, et. al.Nurina Farhanah Binti Johari ... Juliana Jaafar
01 Nov 2022
01 Nov 2022

Soil textural class modeling using digital soil mapping approaches: Effect of resampling strategies on imbalanced dataset predictions
Fereshteh Mirzaei ... Ruth Kerry
Geoderma Regional | VOL. 38
Fereshteh Mirzaei, et. al.Fereshteh Mirzaei ... Ruth Kerry
15 Jun 2024
Geoderma Regional | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Access