The growth of social media usage this last decade has made available a massive and valuable volume of multimedia data. However, the lack of large multimodal annotated datasets, along with the inherent noise and the diversity of multimodal relations in this type of data presents challenges for machine learning methods. Unlike classic multimodal data, social media data comes with a large diversity of relations between image and text making the interaction between the two modalities more difficult.Previous research concentrated on fusion strategies with separate encoders for each modality. This paper introduces CMB (Caption-based Multimodal BERT), a method of classifying crisis-related social media posts by utilizing information from both images and texts. CMB translates the image modality into a text-compatible space, facilitating intermodal interaction. Furthermore, CMB presents training opportunities to enhance the model's robustness to missing modalities. Experimental results show that CMB is competitive with well-established, costly, and manually crafted multimodal models.
Read full abstract