Abstract

As an inevitable phenomenon in real-world applications, data imperfection has emerged as one of the most critical challenges for multimodal sentiment analysis. However, existing approaches tend to overly focus on a specific type of imperfection, leading to performance degradation in real-world scenarios where multiple types of noise exist simultaneously. In this work, we formulate the imperfection with the modality feature missing at the training period and propose the noise intimation based adversarial training framework to improve the robustness against various potential imperfections at the inference period. Specifically, the proposed method first uses temporal feature erasing as the augmentation for noisy instances construction and exploits the modality interactions through the self-attention mechanism to learn multimodal representation for original-noisy instance pairs. Then, based on paired intermediate representation, a novel adversarial training strategy with semantic reconstruction supervision is proposed to learn unified joint representation between noisy and perfect data. For experiments, the proposed method is first verified with the modality feature missing, the same type of imperfection as the training period, and shows impressive performance. Moreover, we show that our approach is capable of achieving outstanding results for other types of imperfection, including modality missing, automation speech recognition error and attacks on text, highlighting the generalizability of our model. Finally, we conduct case studies on general additive distribution, which introduce background noise and blur into raw video clips, further revealing the capability of our proposed method for real-world applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call