Abstract

Multi-modal fusion is a hot topic in field of multi-modal learning. Most of the previous multi-modal fusion tasks are based on the complete modality. Existing researches on missing multi-modal fusion fail to consider the random missing of modalities, thereby lacking robustness. And most of methods are based on the correlation between missing and non-missing modalities, ignoring missing modalities contextual information. Considering the above two issues, we designed a multiple multi-head attentions network based on encoder with missing modalities (MMAN-M2). Firstly, the multi-head attention network is used to represent the single modality by extracting potential features based on the entire sequence, and then they are fused; Then, the missing modality context features are extracted by optimizing the result of multi-modal fusion including missing and non-missing features data, and the missing modalities are encoded through the encoding module; Finally, the Transformer encoder-decoder module is used to train the network model by mapping obtaining global information to multiple spaces and integrating our uncertain multi-modal encoding, and it realizes the classification of multi-modal fusion for evaluating model performance. Extensive experiments on multi-modal public datasets show that the proposed method has the best effect and can effectively improve the classification performance of multi-modal fusion.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.