Abstract

Video captioning is a highly challenging computer vision task that automatically describes the video clips using natural language sentences with a clear understanding of the embedded semantics. In this work, a video caption generation framework consisting of discrete wavelet convolutional neural architecture along with multimodal feature attention is proposed. Here global, contextual and temporal features in the video frames are taken into account and separate attention networks are integrated in the visual attention predictor network to capture multiple attentions from these features. These attended features with textual attention are employed in the visual-to-text translator for caption generation. The experiments are conducted on two benchmark video captioning datasets - MSVD and MSR-VTT. The results prove an improved performance of the method with a CIDEr score of 91.7 and 52.2, for the aforementioned datasets, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.