Abstract

Micro-videos have gained popularity on various social media platforms because they provide a great medium for real-time storytelling. Although micro-videos can be naturally characterized by several modalities, for situations with uncertain missing modalities, a flexible multimodal representation learning framework that integrates complementary and consistent information has been difficult to develop. To better deal with the issue regarding incomplete modalities in multimodal micro-video classification, in this paper, we propose a self-supervised deep multimodal adversarial network (SDMAN) to learn comprehensive and robust micro-video representations. Specifically, we first consider a parallel multi-head attention (MHA) encoding module that simultaneously learns the representations of complete and incomplete modality groupings. We then present a multimodal self-supervised cycle generative adversarial network module, in which multiple generative adversarial networks are explored to transfer the information obtained from the complete modality grouping to the incomplete modality groupings. As a result, complementarity and consistency are mutually promoted among the modalities. Furthermore, experiments conducted on a large-scale micro-video dataset demonstrate that the SDMAN performs better than the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call