INTRODUCTION: With the development of the Internet, users tend to express their opinions and emotions through text, visual and/or audio content. This has increased the interest in multimodal analysis methods. OBJECTIVES: This study addresses multimodal sentiment analysis on tweets related to natural disasters by combining textual and visual embeddings.METHODS: The use of textual representations together with the emotional expressions of the visual content provides a more comprehensive analysis. To investigate the impact of high-level visual and texual features, a three-layer neural network is used in the study, where the first two layers collect features from different modalities and the third layer is used to analyze sentiments. RESULTS: According to experimental tests on our dataset, the highest performance values (77% Accuracy, 71% F1-score) are achieved by using the CLIP model in the image and the RoBERTa model in the text. CONCLUSION: Such analyzes can be used in different application areas such as agencies, advertising, social/digital media content producers, humanitarian aid organizations and can provide important information in terms of social awareness.
Read full abstract