Abstract

The investigation of missing modalities aims to extract valuable feature information from missing multi-modal data, it is a focal point in multi-modal learning. Existing missing modalities processing methods primarily focused on multi-modal fusion schemes to achieve optimal performance, but they faced the following two key challenges: (1) how to improve the robustness of incomplete multi-modal sequence representation, (2) how to effectively learn modality-invariant representations to mitigate heterogeneity between modalities. In this paper, we propose a modality-invariant temporal representation learning-based feature reconstruction network called MIT-FRNet for the missing modalities to tackle these challenges. In the MIT-FRNet, we first extract the latent features for each modality considering intra-modality and inter-modality, and then introduce an encoder-decoder framework to address the first challenge, it reconstructs missing element features via taking the incomplete modal sequences as input and implementing the inter-modal and cross-modal attention mechanisms for feature extraction. And then, through treating each timestamp as a single Gaussian distribution, we design a fine-grained similarity constraint based on distribution-level modality-invariant representations to learn effective modality-invariant representations, thereby addressing the second challenge. Finally, the efficiency of proposed model is validated through the classification results after multi-modal fusion that involves using a gate encoder to pass it and followed by a vector fusion to fuse it. Extensive experiments on public benchmark datasets demonstrate that the proposed MIT-FRNet method achieves promising results under varying missing rates while exhibiting good convergence.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call