Video description combining visual and audio features

Yongbo Li,Zhanbin Che

doi:10.1117/12.2673214

Abstract

Video description has become a research hotspot in recent years because of its wide application value. Single visual feature information can not accurately guide the generation of accurate video description, resulting in the mismatch between the generated description text and video content. To solve this problem, a video description text generation algorithm combining visual and voice features is proposed, which enhances the accuracy of the generated description text by combining visual and voice features. First, the vision transformer model is used to extract the visual feature vector, and the Mel-Frequency Cepstral Coefficients is used to extract the audio feature vector. After the two feature vectors are spliced, the average pooling process is performed to obtain the global feature information; Secondly, the processed feature information is sent to the transformer encoder for encoding. Finally, the encoded results are sent to the transformer decoder to finally generate the video description text. The transformer framework contains a multi head self-attention mechanism, which can focus on more important video feature information while acquiring temporal feature information, making the generated text description more accurate. The method proposed in this paper has been tested on the public data sets MSRVTT and MSVD and has achieved good results in four different evaluation standards.

Full Text