Abstract

Unified venue representation is typically generated by integrating multi-modal information of micro-videos. It is also important to achieve reliable multi-modal fusion. In prior works, the reliability of decisions from multiple modalities is limited without the uncertainty estimation, which indicates whether or to what extent such decisions can be trusted. To this end, in this paper an attention-enhanced and trusted multimodal learning (AETML) model is proposed to achieve reasonable multimodal decision fusion for micro-video venue recognition. Specifically, a domain-adaptive visual transformer (VT) is used as the robust visual feature extractor. Denoise autoencoder and sentence2vector methods are applied to extract acoustic and textural features. Then, attention nets are devised to enhance the extracted features, and output class probabilities of each modality. An uncertainty estimation network is constructed to dynamically assess the reliability of such decisions, which are transformed into Dirichlet distribution for modeling the uncertainty. Lastly, trusted prediction results are yielded by integrating the adjusted distributions with uncertainty estimations according to the Dempster–Shafer evidence theory. The experimental results on a public micro-video dataset show that the proposed model outperforms the state-of-the-art methods. Further studies validate the effectiveness and robustness brought by the uncertainty estimation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call