Abstract
Abstract. With the popularization of artificial intelligence technology, adversarial attacks have become a major challenge in the field of machine learning. This paper explores the robustness of multimodal and unimodal models under textual adversarial attacks, and probes to understand their differences and commonalities. By comparing and analyzing the performance of the CLIP multimodal model and BERT unimodal model under different textual datasets, it is pointed out that the multimodal model does not perform better than the unimodal model under unimodal adversarial attack when the multimodal fusion advantage cannot be reflected. On the contrary, the CLIP model, which is a multimodal model, exhibits larger robustness fluctuations similar to the BERT model under single-modal adversarial attacks. The advantages of multimodal models do not automatically translate into better robustness in all scenarios but need to be optimized for specific tasks and adversarial strategies, and the multimodal models do not have better accuracy than the unimodal models without task-specific pre-training. Both exhibit significant robustness fluctuations in the face of textual adversarial attacks. The research in this paper provides research value and further research directions for future studies
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have