Learning to Answer Visual Questions from Web Videos.

Antoine Yang,Ivan Laptev,Antoine Miech,Cordelia Schmid,Josef Sivic

doi:10.1109/tpami.2022.3173208

Abstract

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our approach generalizes to another source of web video and text data. We generate the WebVidVQA3M dataset from videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE transactions on pattern analysis and machine intelligence	Publication Date: Jan 1, 2024
Citations: 8	License type: other-oa

R Discovery Prime

R Discovery Prime

Learning to Answer Visual Questions from Web Videos.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on pattern analysis and machine intelligence

Lead the way for us

Similar Papers

Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Antoine Yang ... Ivan Laptev
-
Antoine Yang, et. al.Antoine Yang ... Ivan Laptev
01 Oct 2021
01 Oct 2021

Fusing video and text data by integrating appearance and behavior similarity
Georgiy Levchuk ... Charlotte Shabarekh
-
Georgiy Levchuk, et. al.Georgiy Levchuk ... Charlotte Shabarekh
28 May 2013
28 May 2013

Text Mining A Decade Of Focal Development Trends In An African Country
Opoku-Mensah Nelson ... Danso Juliana Mantebea
-
Opoku-Mensah Nelson, et. al.Opoku-Mensah Nelson ... Danso Juliana Mantebea
17 Dec 2021
17 Dec 2021

HAIC-NET: Semi-supervised OCTA vessel segmentation with self-supervised pretext task and dual consistency training
Hailan Shen ... Zailiang Chen
Pattern Recognition | VOL. 151
Hailan Shen, et. al.Hailan Shen ... Zailiang Chen
15 Mar 2024
Pattern Recognition | VOL. 151

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning to Answer Visual Questions from Web Videos.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on pattern analysis and machine intelligence