Abstract
video captioning is a more challenging task that aims to generate abundant natural language descriptions, and it has become a promising direction for artificial intelligence. However, most existing methods are prone to ignore the problems of visual information redundancy and scene information omission due to the limitation of the sampling strategies. To address this problem, a semantic guidance network for video captioning is proposed. More specifically, a novel scene frame sampling strategy is first proposed to select key scene frames. Then, the vision transformer encoder is applied to learn visual and semantic information with a global view, alleviating information loss of modeling long-range dependencies caused in the encoder’s hidden layer. Finally, a non-parametric metric learning module is introduced to calculate the similarity value between the ground truth and the prediction result, and the model is optimized in an end-to-end manner. Experiments on the benchmark MSR-VTT and MSVD datasets show that the proposed method can effectively improve the description accuracy and generalization ability.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have