A Robust and Linguistically-Aware Hate Speech Detection System for Roman Urdu

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Social media sites have developed into a common space for individuals to share their concerns and opinions. There is a chance for individuals and organizations to participate in online behavior that breaches accepted social norms because of the preservation of anonymity and the freedom to communicate ideas without restriction. This leads to a rise in the degree and intensity of hate speech in the online environment. Urdu is the national language of Pakistan and is also widely spoken across several other countries, with over 170 million speakers worldwide. This research addresses the detection of hate speech in Roman Urdu, a prevalent language in Asia, where limited resources exist for mitigating hate speech compared to English. Leveraging machine learning, deep learning, ensemble learning, and natural language processing, we developed a system proficient in understanding Roman Urdu language and culture, capable of identifying diverse hate speech manifestations like abusive language, religious hate, sexism, and racism. We expanded the Roman Urdu Hate Speech and Offensive Language Detection dataset to encompass 30,955 instances, incorporating a novel “Racism” category. Our dataset includes various classes of hate speech such as abusive/offensive, religious hate, sexism, and racism, each reflecting distinct patterns of discriminatory language prevalent in Roman Urdu. After executing text pre-processing, we utilized feature extraction techniques such as Bag of Words and Term Frequency-Inverse Document Frequency embeddings. For model building, we employed several supervised machine learning algorithms, including Random Forest, Decision Tree, Multinomial Naive Bayes, Support Vector Machine, and ensemble methods, coupled with K-Fold cross-validation for robust validation. Additionally, unsupervised learning techniques such as the Gaussian Mixture Model and k-means clustering were also implemented. Deep learning approaches, including Bidirectional Encoder Representations from Transformers, Convolutional Neural Networks, Long Short-Term Memory networks, and multilingual BERT, were explored. Among these, mBERT distinguished itself by achieving an impressive accuracy of 92%, notably surpassing the baseline performance.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.1038/s41598-024-79106-7
Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization.
  • Nov 19, 2024
  • Scientific reports
  • Waqar Ashiq + 7 more

With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solve the automatic hate speech detection problem using different corpora in various languages, however, research on the Urdu language is rather scarce. This study aims to address the HSD task on Twitter using Roman Urdu text. The contribution of this research is the development of a hybrid model for Roman Urdu HSD, which has not been previously explored. The novel hybrid model integrates deep learning (DL) and transformer models for automatic feature extraction, combined with machine learning algorithms (MLAs) for classification. To further enhance model performance, we employ several hyperparameter optimization (HPO) techniques, including Grid Search (GS), Randomized Search (RS), and Bayesian Optimization with Gaussian Processes (BOGP). Evaluation is carried out on two publicly available benchmarks Roman Urdu corpora comprising HS-RU-20 corpus and RUHSOLD hate speech corpus. Results demonstrate that the Multilingual BERT (MBERT) feature learner, paired with a Support Vector Machine (SVM) classifier and optimized using RS, achieves state-of-the-art performance. On the HS-RU-20 corpus, this model attained an accuracy of 0.93 and an F1 score of 0.95 for the Neutral-Hostile classification task, and an accuracy of 0.89 with an F1 score of 0.88 for the Hate Speech-Offensive task. On the RUHSOLD corpus, the same model achieved an accuracy of 0.95 and an F1 score of 0.94 for the Coarse-grained task, alongside an accuracy of 0.87 and an F1 score of 0.84 for the Fine-grained task. These results demonstrate the effectiveness of our hybrid approach for Roman Urdu hate speech detection.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 25
  • 10.1109/access.2022.3216375
Context-Aware Deep Learning Model for Detection of Roman Urdu Hate Speech on Social Media Platform
  • Jan 1, 2022
  • IEEE Access
  • Muhammad Bilal + 3 more

Over the last two decades, social media platforms have grown dramatically. Twitter and Facebook are the two most popular social media platforms, with millions of active users posting billions of messages daily. These platforms allow users to have freedom of expression. However, some users exploit this facility by disseminating hate speeches. Manual detection and censorship of such hate speeches are impractical; thus, an automatic detection mechanism is required to detect and counter hate speeches in a real-time environment. Most research in hate speech detection has been carried out in the English language. Still, minimal work has been explored in other languages, mainly Urdu written in Roman Urdu script. A few research have attempted machine learning, and deep learning models for Roman Urdu hate speech detection; however, due to a scarcity of Roman Urdu resources, and a large corpus with defined annotation rules, a robust hate speech detection model is still required. With this motivation, this study contributes in the following manner: we developed annotation guidelines for Roman Urdu Hate Speech. Second, we constructed a new Roman Urdu Hate Speech Dataset (RU-HSD-30K) that was annotated by a team of experts using the annotation rules. To the best of our knowledge, the Bi-LSTM model with an attention layer for Roman-Urdu Hate Speech Detection has not been explored. Therefore, we developed a context-aware Roman Urdu Hate Speech detection model based on Bi-LSTM with an attention layer and used custom word2vec for word embeddings. Finally, we examined the effect of lexical normalization of Roman Urdu words on the performance of the proposed model. Different traditional as well as deep learning models, including LSTM and CNN models, were used as baseline models. The performance of the models was assessed in terms of evaluation matrices like accuracy, precision, recall, and F1-score. The generalization of each model is also evaluated on a cross-domain dataset. Experimental results revealed that Bi-LSTM with attention outperformed the traditional machine learning models and other deep learning models with an accuracy score of 0.875 and an F-Score of 0.885. In addition, the results demonstrated that our suggested model (Bi-LSTM with Attention Layer) is more general than previous models when applied to unseen data. The results confirmed that lexical normalization of Roman Urdu words enhanced the performance of the suggested model.

  • Research Article
  • Cite Count Icon 34
  • 10.1109/access.2020.3030885
A Precisely Xtreme-Multi Channel Hybrid Approach for Roman Urdu Sentiment Analysis
  • Jan 1, 2020
  • IEEE Access
  • Faiza Mehmood + 5 more

In order to accelerate the performance of various Natural Language Processing tasks for Roman Urdu, this article for the very first time provides 3 neural word embeddings prepared using most widely used approaches namely Word2vec, FastText, and Glove. The integrity of generated neural word embeddings is evaluated using intrinsic and extrinsic evaluation approaches. Considering the lack of publicly available benchmark datasets, it provides a first-ever Roman Urdu public dataset which consists of 3241 sentiments annotated against positive, negative, and neutral classes. To provide benchmark baseline performance over the presented dataset for Roman Urdu sentiment analysis, we adapt diverse machine learning (Support Vector Machine, Logistic Regression, Naive Bayes), deep learning (convolutional neural network, recurrent neural network), and hybrid deep learning approaches. Performance impact of generated neural word embeddings based representation is compared with other most widely used bag of words based feature representation approaches using diverse machine and deep learning classifiers. In order to improve the performance of Roman Urdu sentiment analysis, it proposes a novel precisely extreme multi-channel hybrid methodology which makes use of convolutional and recurrent neural networks along with pre-trained neural word embeddings. The proposed hybrid approach outperforms adapted machine learning approaches by the significant figure of 9% and deep learning approaches by the figure of 4% in terms of F1-score.

  • Research Article
  • 10.7717/peerj-cs.3342
Detecting hate speech in roman Urdu using a convolutional-BiLSTM-based deep hybrid neural network
  • Nov 3, 2025
  • PeerJ Computer Science
  • Muhammad Zohaib + 7 more

The detection of hate speech on social media has become a pressing challenge, particularly in multilingual and low-resource language settings such as Roman Urdu, where informal grammar, code-switching, and inconsistent orthography hinder accurate classification. Despite progress in hate speech detection for high-resource languages, limited research exists for Roman Urdu content. This study addresses this gap by proposing a computationally efficient deep learning framework based on a hybrid convolutional neural network and bidirectional long short-term memory (CNN-BiLSTM) architecture. The model leverages FastText pre-trained embeddings to capture subword-level semantics and combines convolutional layers for local feature extraction with BiLSTM for global context modeling. We evaluate our approach on a labeled Roman Urdu dataset and compare it with traditional machine learning models and deep learning baselines. Our proposed CNN-BiLSTM model achieves the highest performance with an accuracy of 80.67% and an F1-score of 81.47%, outperforming competitive baselines. These findings demonstrate the effectiveness and practicality of our lightweight architecture in detecting hate speech in Roman Urdu, offering a scalable solution for multilingual and resource-constrained environments.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/a18060331
Machine Learning- and Deep Learning-Based Multi-Model System for Hate Speech Detection on Facebook
  • Jun 1, 2025
  • Algorithms
  • Amna Naseeb + 6 more

Hate speech is a complex topic that transcends language, culture, and even social spheres. Recently, the spread of hate speech on social media sites like Facebook has added a new layer of complexity to the issue of online safety and content moderation. This study seeks to minimize this problem by developing an Arabic script-based tool for automatically detecting hate speech in Roman Urdu, an informal script used most commonly for South Asian digital communications. Roman Urdu is relatively complex as there are no standardized spellings, leading to syntactic variations, which increases the difficulty of hate speech detection. To tackle this problem, we adopt a holistic strategy using a combination of six machine learning (ML) and four Deep Learning (DL) models, a dataset from Facebook comments, which was preprocessed (tokenization, stopwords removal, etc.), and text vectorization (TF-IDF, word embeddings). The ML algorithms used in this study are LR, SVM, RF, NB, KNN, and GBM. We also use deep learning architectures like CNN, RNN, LSTM, and GRU to increase the accuracy of the classification further. It is proven by the experimental results that deep learning models outperform the traditional ML approaches by a significant margin, with CNN and LSTM achieving accuracies of 95.1% and 96.2%, respectively. As far as we are aware, this is the first work that investigates QLoRA for fine-tuning large models for the task of offensive language detection in Roman Urdu.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 25
  • 10.3390/s23083909
Roman Urdu Hate Speech Detection Using Transformer-Based Model for Cyber Security Applications
  • Apr 12, 2023
  • Sensors (Basel, Switzerland)
  • Muhammad Bilal + 4 more

Social media applications, such as Twitter and Facebook, allow users to communicate and share their thoughts, status updates, opinions, photographs, and videos around the globe. Unfortunately, some people utilize these platforms to disseminate hate speech and abusive language. The growth of hate speech may result in hate crimes, cyber violence, and substantial harm to cyberspace, physical security, and social safety. As a result, hate speech detection is a critical issue for both cyberspace and physical society, necessitating the development of a robust application capable of detecting and combating it in real-time. Hate speech detection is a context-dependent problem that requires context-aware mechanisms for resolution. In this study, we employed a transformer-based model for Roman Urdu hate speech classification due to its ability to capture the text context. In addition, we developed the first Roman Urdu pre-trained BERT model, which we named BERT-RU. For this purpose, we exploited the capabilities of BERT by training it from scratch on the largest Roman Urdu dataset consisting of 173,714 text messages. Traditional and deep learning models were used as baseline models, including LSTM, BiLSTM, BiLSTM + Attention Layer, and CNN. We also investigated the concept of transfer learning by using pre-trained BERT embeddings in conjunction with deep learning models. The performance of each model was evaluated in terms of accuracy, precision, recall, and F-measure. The generalization of each model was evaluated on a cross-domain dataset. The experimental results revealed that the transformer-based model, when directly applied to the classification task of the Roman Urdu hate speech, outperformed traditional machine learning, deep learning models, and pre-trained transformer-based models in terms of accuracy, precision, recall, and F-measure, with scores of 96.70%, 97.25%, 96.74%, and 97.89%, respectively. In addition, the transformer-based model exhibited superior generalization on a cross-domain dataset.

  • Research Article
  • Cite Count Icon 53
  • 10.1145/3414524
Hate Speech Detection in Roman Urdu
  • Jan 31, 2021
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Muhammad Moin Khan + 2 more

Hate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed manually. To address this issue, several studies have been conducted for hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making the social media vulnerable for millions of users. In particular, to the best of our knowledge, no study has been conducted for hate speech detection in Roman Urdu text, which is widely used in the sub-continent. In this study, we have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets. Subsequently, we have employed an iterative approach to develop guidelines and used them for generating the Hate Speech Roman Urdu 2020 corpus. The tweets in the this corpus are classified at three levels: Neutral-Hostile, Simple-Complex, and Offensive-Hate speech. As another contribution, we have used five supervised learning techniques, including a deep learning technique, to evaluate and compare their effectiveness for hate speech detection. The results show that Logistic Regression outperformed all other techniques, including deep learning techniques for the two levels of classification, by achieved an F1 score of 0.906 for distinguishing between Neutral-Hostile tweets, and 0.756 for distinguishing between Offensive-Hate speech tweets.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 59
  • 10.2196/22609
Detection of Hate Speech in COVID-19-Related Tweets in the Arab Region: Deep Learning and Topic Modeling Approach.
  • Dec 8, 2020
  • Journal of Medical Internet Research
  • Raghad Alshalan + 4 more

BackgroundThe massive scale of social media platforms requires an automatic solution for detecting hate speech. These automatic solutions will help reduce the need for manual analysis of content. Most previous literature has cast the hate speech detection problem as a supervised text classification task using classical machine learning methods or, more recently, deep learning methods. However, work investigating this problem in Arabic cyberspace is still limited compared to the published work on English text.ObjectiveThis study aims to identify hate speech related to the COVID-19 pandemic posted by Twitter users in the Arab region and to discover the main issues discussed in tweets containing hate speech.MethodsWe used the ArCOV-19 dataset, an ongoing collection of Arabic tweets related to COVID-19, starting from January 27, 2020. Tweets were analyzed for hate speech using a pretrained convolutional neural network (CNN) model; each tweet was given a score between 0 and 1, with 1 being the most hateful text. We also used nonnegative matrix factorization to discover the main issues and topics discussed in hate tweets.ResultsThe analysis of hate speech in Twitter data in the Arab region identified that the number of non–hate tweets greatly exceeded the number of hate tweets, where the percentage of hate tweets among COVID-19 related tweets was 3.2% (11,743/547,554). The analysis also revealed that the majority of hate tweets (8385/11,743, 71.4%) contained a low level of hate based on the score provided by the CNN. This study identified Saudi Arabia as the Arab country from which the most COVID-19 hate tweets originated during the pandemic. Furthermore, we showed that the largest number of hate tweets appeared during the time period of March 1-30, 2020, representing 51.9% of all hate tweets (6095/11,743). Contrary to what was anticipated, in the Arab region, it was found that the spread of COVID-19–related hate speech on Twitter was weakly related with the dissemination of the pandemic based on the Pearson correlation coefficient (r=0.1982, P=.50). The study also identified the commonly discussed topics in hate tweets during the pandemic. Analysis of the 7 extracted topics showed that 6 of the 7 identified topics were related to hate speech against China and Iran. Arab users also discussed topics related to political conflicts in the Arab region during the COVID-19 pandemic.ConclusionsThe COVID-19 pandemic poses serious public health challenges to nations worldwide. During the COVID-19 pandemic, frequent use of social media can contribute to the spread of hate speech. Hate speech on the web can have a negative impact on society, and hate speech may have a direct correlation with real hate crimes, which increases the threat associated with being targeted by hate speech and abusive language. This study is the first to analyze hate speech in the context of Arabic COVID-19–related tweets in the Arab region.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.3390/math11040969
Geo-Spatial Mapping of Hate Speech Prediction in Roman Urdu
  • Feb 14, 2023
  • Mathematics
  • Samia Aziz + 4 more

Social media has transformed into a crucial channel for political expression. Twitter, especially, is a vital platform used to exchange political hate in Pakistan. Political hate speech affects the public image of politicians, targets their supporters, and hurts public sentiments. Hate speech is a controversial public speech that promotes violence toward a person or group based on specific characteristics. Although studies have been conducted to identify hate speech in European languages, Roman languages have yet to receive much attention. In this research work, we present the automatic detection of political hate speech in Roman Urdu. An exclusive political hate speech labeled dataset (RU-PHS) containing 5002 instances and city-level information has been developed. To overcome the vast lexical structure of Roman Urdu, we propose an algorithm for the lexical unification of Roman Urdu. Three vectorization techniques are developed: TF-IDF, word2vec, and fastText. A comparative analysis of the accuracy and time complexity of conventional machine learning models and fine-tuned neural networks using dense word representations is presented for classifying and predicting political hate speech. The results show that a random forest and the proposed feed-forward neural network achieve an accuracy of 93% using fastText word embedding to distinguish between neutral and politically offensive speech. The statistical information helps identify trends and patterns, and the hotspot and cluster analysis assist in pinpointing Punjab as a highly susceptible area in Pakistan in terms of political hate tweet generation.

  • Research Article
  • Cite Count Icon 37
  • 10.1007/s00530-021-00784-8
Abusive language detection from social media comments using conventional machine learning and deep learning approaches
  • Apr 1, 2021
  • Multimedia Systems
  • Muhammad Pervez Akhter + 4 more

With the increase in the culture of social media and netizen, every day, millions of comments are posted on the uploaded posts. The use of abusive language in user comments has been increased rapidly. Abusive language in online comments initiates cyber-bullying that targets individuals (celebrity, politician, and product) and a group of people (specific country, age, and religion). It is important to detect and analyze abusive language from online comments automatically. There have been several attempts in the literature to detect abusive language for English. In this study, we perform abusive language detection from Urdu and Roman Urdu comments using five diverse ML models (NB, SVM, IBK, Logistic, and JRip) and four DL models (CNN, LSTM, BLSTM, and CLSTM). We apply these models on a large dataset with ten thousands of Roman Urdu comments and a small dataset with more than two thousand comments of Urdu. Natural language constructs, English-like nature of Roman Urdu script, and Nastaleeq style of Urdu make it more challenging to process and classify the comments of both scripts using deep learning and machine learning approaches. From experiments, we find that the convolutional neural network outperforms the other models and achieves 96.2% and 91.4% accuracy on Urdu and Roman Urdu. Our results also reveal that the one-layer architectures of deep learning models give better results than two-layer architectures. Further, we compare the performance of deep learning models with five conventional machine learning models and conclude that deep learning models perform significantly better than machine learning models.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 23
  • 10.22581/muet1982.1902.20
Sentiment Analysis for Roman Urdu
  • Apr 1, 2019
  • Mehran University Research Journal of Engineering and Technology
  • Ayesha Rafique + 4 more

The majority of online comments/opinions are written in text-free format. Sentiment Analysis can be used as a measure to express the polarity (positive/negative) of comments/opinions. These comments/ opinions can be in different languages i.e. English, Urdu, Roman Urdu, Hindi, Arabic etc. Mostly, people have worked on the sentiment analysis of the English language. Very limited research work has been done in Urdu or Roman Urdu languages. Whereas, Hindi/Urdu is the third largest language in the world. In this paper, we focus on the sentiment analysis of comments/opinions in Roman Urdu. There is no publicly available Roman Urdu public opinion dataset. We prepare a dataset by taking comments/opinions of people in Roman Urdu from different websites. Three supervised machine learning algorithms namely NB (Naive Bayes), LRSGD (Logistic Regression with Stochastic Gradient Descent) and SVM (Support Vector Machine) have been applied on this dataset. From results of experiments, it can be concluded that SVM performs better than NB and LRSGD in terms of accuracy. In case of SVM, an accuracy of 87.22% is achieved.

  • Research Article
  • Cite Count Icon 10
  • 10.1145/3474119
An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu
  • Nov 3, 2021
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Toqir A Rana + 4 more

During the last two decades, sentiment analysis, also known as opinion mining, has become one of the most explored research areas in Natural Language Processing (NLP) and data mining. Sentiment analysis focuses on the sentiments or opinions of consumers expressed over social media or different web sites. Due to exposure on the Internet, sentiment analysis has attracted vast numbers of researchers over the globe. A large amount of research has been conducted in English, Chinese, and other languages used worldwide. However, Roman Urdu has been neglected despite being the third most used language for communication in the world, covering millions of users around the globe. Although some techniques have been proposed for sentiment analysis in Roman Urdu, these techniques are limited to a specific domain or developed incorrectly due to the unavailability of language resources available for Roman Urdu. Therefore, in this article, we are proposing an unsupervised approach for sentiment analysis in Roman Urdu. First, the proposed model normalizes the text to overcome spelling variations of different words. After normalizing text, we have used Roman Urdu and English opinion lexicons to correctly identify users’ opinions from the text. We have also incorporated negation terms and stemming to assign polarities to each extracted opinion. Furthermore, our model assigns a score to each sentence on the basis of the polarities of extracted opinions and classifies each sentence as positive, negative, or neutral. In order to verify our approach, we have conducted experiments on two publicly available datasets for Roman Urdu and compared our approach with the existing model. Results have demonstrated that our approach outperforms existing models for sentiment analysis tasks in Roman Urdu. Furthermore, our approach does not suffer from domain dependency.

  • Research Article
  • Cite Count Icon 13
  • 10.32604/cmes.2022.019535
Sentiment Analysis of Roman Urdu on E-Commerce Reviews Using Machine Learning
  • Jan 1, 2022
  • Computer Modeling in Engineering & Sciences
  • Bilal Chandio + 7 more

Sentiment analysis task has widely been studied for various languages such as English and French. However, Roman Urdu sentiment analysis yet requires more attention from peer-researchers due to the lack of Off-the-Shelf Natural Language Processing (NLP) solutions. The primary objective of this study is to investigate the diverse machine learning methods for the sentiment analysis of Roman Urdu data which is very informal in nature and needs to be lexically normalized. To mitigate this challenge, we propose a fine-tuned Support Vector Machine (SVM) powered by Roman Urdu Stemmer. In our proposed scheme, the corpus data is initially cleaned to remove the anomalies from the text. After initial pre-processing, each user review is being stemmed. The input text is transformed into a feature vector using the bag-of-word model. Subsequently, the SVM is used to classify and detect user sentiment. Our proposed scheme is based on a dictionary based Roman Urdu stemmer. The creation of the Roman Urdu stemmer is aimed at standardizing the text so as to minimize the level of complexity. The efficacy of our proposed model is also empirically evaluated with diverse experimental configurations, so as to fine-tune the hyper-parameters and achieve superior performance. Moreover, a series of experiments are conducted on diverse machine learning and deep learning models to compare the performance with our proposed model. We also introduced the largest dataset on Roman Urdu, i.e., Roman Urdu e-commerce dataset (RUECD), which contains 26K+ user reviews annotated by the group of experts. The RUECD is challenging and the largest dataset available of Roman Urdu. The experiments show that the newly generated dataset is quite challenging and requires more attention from the peer researchers for Roman Urdu sentiment analysis.

  • Book Chapter
  • Cite Count Icon 4
  • 10.1007/978-3-030-93420-0_8
Detecting Hate Speech in Cross-Lingual and Multi-lingual Settings Using Language Agnostic Representations
  • Jan 1, 2021
  • Sebastián E Rodríguez + 2 more

The automatic detection of hate speech is a blooming field in the natural language processing community. In recent years there have been efforts in detecting hate speech in multiple languages, using models trained on multiple languages at the same time. Furthermore, there is special interest in the capabilities of language agnostic features to represent text in hate speech detection. This is because models can be trained in multiple languages, and then the capabilities of the model and representation can be tested on a unseen language.In this work we focused on detecting hate speech in mono-lingual, multi-lingual and cross-lingual settings. For this we used a pre-trained language model called Language Agnostic BERT Sentence Embeddings (LabSE), both for feature extraction and as an end to end classification model. We tested different models such as Support Vector Machines and Tree-based models, and representations in particular bag of words, bag of characters, and sentence embeddings extracted from Multi-lingual BERT. The dataset used was the SemEval 2019 task 5 data set, which covers hate speech against immigrants and women in English and Spanish. The results show that the usage of LabSE as feature extraction improves the performance on both languages in a mono-lingual setting, and in a cross-lingual setting. Moreover, LabSE as an end to end classification model performs better than the reported by the authors of SemEval 2019 task 5 data set for the Spanish language.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-93709-6_41
Multi-channel Convolutional Neural Network for Hate Speech Detection in Social Media
  • Jan 1, 2022
  • Zeleke Abebaw + 2 more

As online social media content continues to grow, so does the spread of hate speech. Hate speech has devastating consequences unless it is detected and monitored early. Recently, deep neural network-based hate speech detection models, particularly conventional single-channel Convolutional Neural Network (CNN), have achieved remarkable performance. However, the effectiveness of the models depends on the type of language they are trained on and the training data size. We argue that the effectiveness of the models could further be enhanced if we use multi-channel CNN models even for under-resourced languages that have limited training data size. This is because the single-channel CNN might fail to consider the potential effect of multiple channels to generate better features, which is not well investigated for hate speech detection. Therefore, in this work, we explore the use of multi-channel CNN to extract better features from different channels in an end-to-end manner on top of a word2vec embedding layer. Tested on a new small-scale Amharic hate speech dataset containing 2000 annotated social media comments, the experimental results show that the proposed multi-channel CNN model outperforms the single-channel CNN models but underperform from the baseline Support Vector Machine (SVM) with an average F-score of 81.3%, 78.2%, and 92.5% respectively. The finding of the study implies that the proposed MC-CNN model can be used as an alternative solution for hate speech detection using a deep learning approach when dataset scarcity is an issue.KeywordsSocial mediaDeep learningWord embeddingAmharic hate speech detectionSingle-channelMulti-channelConvolutional neural network

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon