MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Mingxing Li,Hongzhe Liu,Cheng Xu,Xuewei Li,Chenyang Yan,Hao Zhang

doi:10.3390/electronics11192999

Mingxing Li, Hongzhe Liu + Show 4 more

Open Access

PDF Available

https://doi.org/10.3390/electronics11192999

Copy DOI

Export

Save

Cite

Journal: Electronics	Publication Date: Sep 21, 2022
Citations: 1	License type: CC BY 4.0

Affiliation: Beijing Union University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

With the development of electronic technology, intelligent cars can gradually realize more complex artificial intelligence algorithms. The video caption algorithm is one of them. However, current video caption algorithms only consider single-visual information when applied to urban traffic scenes, which leads to the failure to generate accurate captions of complex sets. The multimodal fusion algorithm based on Transformer is one of the solutions to this problem. However, the existing algorithms have the difficulties of a low fusion performance and high computational complexity. We propose a new video caption Transformer-based model, the MFVC (Multimodal Fusion for Video Caption), to solve these issues. We introduce audio modal data and the attention bottleneck module to increase the available information to describe the generative model and improve the model effect with less operation costs through the attention bottleneck module. Finally, the experiment is conducted on the available datasets, MSR-VTT and MSVD. Meanwhile, to verify the effect of the model on the urban traffic scene, the experiment is carried out on the self-built traffic caption dataset BUUISE, and the evaluation index confirms the model. This model can achieve good results on both available datasets and urban traffic datasets and has excellent application prospects in the intelligent driving industry.

Full Text