Abstract

The video captioning problem consists of describing a short video clip with natural language. Existing solutions tend to rely on extracting features from frames or sets of frames with pretrained and fixed Convolutional Neural Networks (CNNs). Traditionally, the CNNs are pretrained on the ImageNet-1K (IN1K) classification task. The features are then fed into a sequence-to-sequence model to produce the text description output. In this paper, we propose using Facebook's ResNeXt Weakly Supervised Learning (WSL) CNNs as fixed feature extractors for video captioning. These CNNs are trained on billion-scale weakly supervised datasets constructed from Instagram image-hashtag pairs and then fine-tuned on IN1K. Whereas previous works use complicated architectures or multimodal features, we demonstrate state-of-the-art performance on the Microsoft Video Description (MSVD) dataset and competitive results on the Microsoft Research-Video to Text (MSR-VTT) dataset using only the frame-level features from the new CNNs and a basic Transformer as a sequence-to-sequence model. Moreover, our results validate that CNNs pretrained with weak supervision can effectively transfer to tasks other than classification. Finally, we present results for a number of IN1K feature extractors and discuss the relationship between IN1K accuracy and video captioning performance. Code will be made available at https://github. com/flauted/OpenNMT-py.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call