Abstract

ABSTRACT Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call