Abstract

In this talk we present recent progress on large-scale learning of multimodal video representations. We start by presenting VideoBert, a joint model for video and language, repurposing the Bert model for multimodal data. This model achieves state-of-the-art results on zero shot prediction and video captioning. Next we show how to extend learning from instruction videos to general movies based on cross-modal supervision. We use movie screenplays to learn a speech to action classifiers and use these classifiers to mine video clips from thousands of hours of movies. We demonstrate a performance comparable or better than fully supervised approaches for action classification. Next we present an approach for video question answering which relies on training from instruction videos and cross-modal supervision with a textual question answer module. We show state-of-the-art results for video question answering without any supervision (zero-shot VQA) and demonstrate that our approach obtains competitive results for pre-training and then fine-tuning on video question answering datasets. We conclude our talk by presenting a recent video feature which is fully transformer based. Our Video Vision Transformer (ViViT) is shown to outperform the state-of-the-art on video classification. Furthermore, it is flexible and allows for performance / accuracy trade-off based on several different architectures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.