Do you see what I see?

Cordelia Schmid

doi:10.1145/3474085.3476967

Abstract

In this talk we present recent progress on large-scale learning of multimodal video representations. We start by presenting VideoBert, a joint model for video and language, repurposing the Bert model for multimodal data. This model achieves state-of-the-art results on zero shot prediction and video captioning. Next we show how to extend learning from instruction videos to general movies based on cross-modal supervision. We use movie screenplays to learn a speech to action classifiers and use these classifiers to mine video clips from thousands of hours of movies. We demonstrate a performance comparable or better than fully supervised approaches for action classification. Next we present an approach for video question answering which relies on training from instruction videos and cross-modal supervision with a textual question answer module. We show state-of-the-art results for video question answering without any supervision (zero-shot VQA) and demonstrate that our approach obtains competitive results for pre-training and then fine-tuning on video question answering datasets. We conclude our talk by presenting a recent video feature which is fully transformer based. Our Video Vision Transformer (ViViT) is shown to outperform the state-of-the-art on video classification. Furthermore, it is flexible and allows for performance / accuracy trade-off based on several different architectures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Do you see what I see?

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.
Lianli Gao ... Meng Wang
IEEE Transactions on Image Processing | VOL. 31
Lianli Gao, et. al.Lianli Gao ... Meng Wang
01 Jan 2021
IEEE Transactions on Image Processing | VOL. 31

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization
Zineng Tang ... Jie Lei
-
Zineng Tang, et. al.Zineng Tang ... Jie Lei
01 Jan 2020
01 Jan 2020

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization
...
-
, et. al. ...
25 May 2021
25 May 2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Kevin Lin ... Chung-Ching Lin
-
Kevin Lin, et. al.Kevin Lin ... Chung-Ching Lin
01 Jun 2022
01 Jun 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Do you see what I see?

Abstract

Talk to us

Similar Papers