Abstract

Memes have become a fundamental part of online communication and humour, reflecting and shaping the culture of today’s digital age. The amplified Meme culture is inadvertently endorsing and propagating casual Misogyny. This study proposes V-LTCS (Vision- Language Transformer Combination Search), a framework that encompasses all possible combinations of the most fitting Text (i.e. BERT, ALBERT, and XLM-R) and Vision (i.e. Swin, ConvNeXt, and ViT) Transformer Models to determine the backbone architecture for identifying Memes that contains misogynistic contents. All feasible Vision-Language Transformer Model combinations obtained from the recognized optimal Text and Vision Transformer Models are evaluated on two (smaller and larger) datasets using varied standard metrics (viz. Accuracy, Precision, Recall, and F1-Score). The BERT-ViT combinational Transformer Model demonstrated its efficiency on both datasets, validating its ability to serve as a backbone architecture for all subsequent efforts to recognize Multimodal Misogynous Memes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.