Abstract

Video captioning aims to generate a grammatical and accurate sentence to describe a video. Recent methods have mainly tackled this problem by considering multiple modalities, yet they have neglected the difference in modalities and the importance of shrinking the gap between video and text. This paper proposes a multi-task video-captioning method with a Stepwise Multimodal Encoder. The encoder can flexibly digest multiple modalities by assigning a proper encoding depth for each modality. We also exploit both video-to-text (V2T) and text-to-video (T2V) flows by adding an auxiliary task of video–text semantic matching. We successfully achieve state-of-the-art performance on two widely known datasets: MSVD and MSR-VTT: (1) with the MSVD dataset, our method achieves an 18% improvement in CIDEr; (2) with the MSR-VTT dataset, our method achieves a 6% improvement in CIDEr.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.