A curated dataset for hate speech detection on social media text

Yidong Huang,Thiago Eustaquio Alves De Oliveira,Devansh Mody

doi:10.1016/j.dib.2022.108832

Abstract

Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. Therefore, our dataset is curated from various sources like Kaggle, GitHub, and other websites. This dataset contains hate speech sentences in English and is confined into two classes, one representing hateful content and the other representing non-hateful content. It has 451,709 sentences in total. 371,452 of these are hate speech, and 80,250 are non-hate speech. An augmented balanced dataset with 726,120 samples is also generated to create a custom vocabulary of 145,046 words. The total number of contractions considered in the dataset is 6403. The total number of bad words usually used in hateful content is 377. The text in each sentence of the final dataset, which is utilized for training and cross-validation, is limited to 180 words. The generated contractions dataset can be used for any projects in the area of NLP for data preprocessing. The augmented dataset can help to reduce the number of out-of-vocabulary words, and the hate speech dataset can be used as a classifier to detect hate or no hate on social media platforms.

Full Text