Abstract

Multi-modal fusion is a hot topic in field of multi-modal learning. Most of the previous multi-modal fusion tasks are based on the complete modality. Existing researches on missing multi-modal fusion fail to consider the random missing of modalities, thereby lacking robustness. And most of methods are based on the correlation between missing and non-missing modalities, ignoring missing modalities contextual information. Considering the above two issues, we designed a multiple multi-head attentions network based on encoder with missing modalities (MMAN-M2). Firstly, the multi-head attention network is used to represent the single modality by extracting potential features based on the entire sequence, and then they are fused; Then, the missing modality context features are extracted by optimizing the result of multi-modal fusion including missing and non-missing features data, and the missing modalities are encoded through the encoding module; Finally, the Transformer encoder-decoder module is used to train the network model by mapping obtaining global information to multiple spaces and integrating our uncertain multi-modal encoding, and it realizes the classification of multi-modal fusion for evaluating model performance. Extensive experiments on multi-modal public datasets show that the proposed method has the best effect and can effectively improve the classification performance of multi-modal fusion.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call