The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi–English (Code-Mixed) Tweets

Khalil Al-Hussaeni,Mohamed Sameer,Ioannis Karamitsos

doi:10.3390/app131911104

Abstract

Due to the increasing reliance on social network platforms in recent years, hate speech has risen significantly among online users. Government and social media platforms face the challenging responsibility of controlling, detecting, and removing massively growing hateful content as early as possible to prevent future criminal acts, such as cyberviolence and real-life hate crimes. Twitter is used globally by people from various backgrounds and nationalities; it contains tweets posted in different languages, including code-mixed language, such as Hindi–English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is especially challenging in code-mixed text. In this paper, we tackle the critical issue of hate speech detection on social media, with a focus on a mix of English and Hindi–English (code-mixed) text messages on Twitter. More specifically, we aim to evaluate the impact of data pre-processing on hate speech detection. Our method first performs 10-step data cleansing; then, it builds a detection method based on two architectures, namely a convolutional neural network (CNN) and a combination of CNN and long short-term Memory (LSTM) algorithms. We tune the hyperparameters of the proposed model architectures and conduct extensive experimental analysis on real-life tweets to evaluate the performance of the models in terms of accuracy, efficiency, and scalability. Moreover, we compare our method with a closely related hate speech detection method from the literature. The experimental results suggest that our method results in an improved accuracy and a significantly improved runtime. Among our best-performing models, CNN-LSTM improved accuracy by nearly 2% and decreased the runtime by almost half.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi–English (Code-Mixed) Tweets

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Journal: Applied Sciences	Publication Date: Oct 9, 2023
License type: CC BY 4.0

Similar Papers

Hate speech and offensive language detection in Dravidian languages using deep ensemble framework
Pradeep Kumar Roy ... Chinnaudayar Navaneethakrishnan Subalalitha
Computer Speech & Language | VOL. 75
Pradeep Kumar Roy, et. al.Pradeep Kumar Roy ... Chinnaudayar Navaneethakrishnan Subalalitha
05 Apr 2022
Computer Speech & Language | VOL. 75

A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection
Aditya Bohra ... Manish Shrivastava
-
Aditya Bohra, et. al.Aditya Bohra ... Manish Shrivastava
01 Jan 2018
01 Jan 2018

Sentimental analysis & Hate speech detection on English and German text collected from social media platforms using optimal feature extraction and hybrid diagonal gated recurrent neural network
Purbani Kar ... Swapan Debbarma
Engineering Applications of Artificial Intelligence | VOL. 126
Purbani Kar, et. al.Purbani Kar ... Swapan Debbarma
27 Sep 2023
Engineering Applications of Artificial Intelligence | VOL. 126

Towards Automatic Detection and Explanation of Hate Speech and Offensive Language
Wyatt Dorris ... Ruijia (Roger) Hu
-
Wyatt Dorris, et. al.Wyatt Dorris ... Ruijia (Roger) Hu
16 Mar 2020
16 Mar 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi–English (Code-Mixed) Tweets

Abstract

Talk to us

Similar Papers

More From: Applied Sciences