In today’s society dissemination of information among the individuals occur very rapidly due to the widespread usage of social media platforms like Twitter (now-a-days acclaimed as X). However, information may pose challenges to maintaining a healthy online environment because often it contains harmful content. This paper presents a novel approach to identify different categories of offensive posts such as hate speech, profanity, targeted insult, and derogatory commentary by analyzing multi-modal image and text data, collected from Twitter. We propose a comprehensive deep learning framework, “Value Mixed Cross Attention Transformer” (VMCA-Trans) that leverage a combination of computer vision and natural language processing methodologies to effectively classify the posts into four classes with binary labels. We have created an in-house dataset (OffenTweet) comprising of Twitter posts having textual content, accompanying with images to build the proposed model. The dataset is carefully annotated by several experts with offensive labels such as hate speech, profanity, targeted insult, and derogatory commentary. VMCA-Trans utilizes fine-tuned state-of-the-art transformer based backbones such as ViT, BERT, RoBERTA etc. The combined representation of image and text embeddings obtained by these fine-tuned transformer encoders, is fed into a classifier to categorize the posts into offensive and non-offensive classes. To assess its effectiveness, we extensively evaluate the VMCA-Trans model using various performance metrics. The results indicate that the proposed multi-modal approach, achieves superior performance compared to traditional unimodal methods.
Read full abstract