Abstract

Effective unimodal representation and complementary crossmodal representation fusion are both important in multimodal representation learning. Prior works often modulate one modal feature to another straightforwardly and thus, underutilizing both unimodal and crossmodal representation refinements, which incurs a bottleneck of performance improvement. In this paper, Unimodal and Crossmodal Refinement Network (UCRN) is proposed to enhance both unimodal and crossmodal representations. Specifically, to improve unimodal representations, a unimodal refinement module is designed to refine modality-specific learning via iteratively updating the distribution with transformer-based attention layers. Self-quality improvement layers are followed to generate the desired weighted representations progressively. Subsequently, those unimodal representations are projected into a common latent space, regularized by a multimodal Jensen-Shannon divergence loss for better crossmodal refinement. Lastly, a crossmodal refinement module is employed to integrate all information. By hierarchical explorations on unimodal, bimodal, and trimodal interactions, UCRN is highly robust against missing modality and noisy data. Experimental results on MOSI and MOSEI datasets illustrated that the proposed UCRN outperforms recent state-of-the-art techniques and its robustness is highly preferred in real multimodal sequence fusion scenarios. Codes will be shared publicly.

Highlights

  • Motivated by recent research achievements on modality representations in language (Pennington et al, 2014; Devlin et al, 2019; Brown et al, 2020), audio (Degottex et al, 2014; Chen et al, 2018; Li et al, 2019), and vision

  • In Self-Quality Improvement Layers (SQIL), the refined unimodal representation is first passed through a global average pooling on feature dimension, and a fully connected layer to Unimodal and Crossmodal Refinement Network (UCRN) 82.01 81.71 0.89 0.69 learn the correlation within features, resulting in a

  • For fair comparision with TBJE, the experimental results are reported on this dataset

Read more

Summary

Introduction

Motivated by recent research achievements on modality representations in language (Pennington et al, 2014; Devlin et al, 2019; Brown et al, 2020), audio (Degottex et al, 2014; Chen et al, 2018; Li et al, 2019), and vision From this view, the strategy of direct modality modulation or pairwise translation. C 2021 Association for Computational Linguistics may jeopardize learning rich-fused representation and lead to suboptimal results. Most of those network architectures require all modalities as input. The learned representations may perform poorly in the real world where complete modalities might not always be simultaneously available (e.g., some specific modality is missing or noisy). This may be because of over fusing or losing sight of addressing the importance of unimodal refinement

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call