TEVL: Trilinear Encoder for Video-language Representation Learning

Xin Man,Feiyu Chen,Mingxing Zhang,Heng Tao Shen,Jie Shao

doi:10.1145/3585388

Abstract

Pre-training model on large-scale unlabeled web videos followed by task-specific fine-tuning is a canonical approach to learning video and language representations. However, the accompanying Automatic Speech Recognition (ASR) transcripts in these videos are directly transcribed from audio, which may be inconsistent with visual information and would impair the language modeling ability of the model. Meanwhile, previous V-L models fuse visual and language modality features using single- or dual-stream architectures, which are not suitable for the current situation. Besides, traditional V-L research focuses mainly on the interaction between vision and language modalities and leaves the modeling of relationships within modalities untouched. To address these issues and maintain a small manual labor cost, we add automatically extracted dense captions as a supplementary text and propose a new trilinear video-language interaction framework TEVL (Trilinear Encoder for Video-Language representation learning). TEVL contains three unimodal encoders, a TRIlinear encOder (TRIO) block, and a temporal Transformer. TRIO is specially designed to support effective text-vision-text interaction, which encourages inter-modal cooperation while maintaining intra-modal dependencies. We pre-train TEVL on the HowTo100M and TV datasets with four task objectives. Experimental results demonstrate that TEVL can learn powerful video-text representation and achieve competitive performance on three downstream tasks, including multimodal video captioning, video Question Answering (QA), as well as video and language inference. Implementation code is available at https://github.com/Gufrannn/TEVL .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

TEVL: Trilinear Encoder for Video-language Representation Learning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Jun 7, 2023
Citations: 3

Similar Papers

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
Sujeong Cha ... Samuel Thomas
-
Sujeong Cha, et. al.Sujeong Cha ... Samuel Thomas
30 Aug 2021
30 Aug 2021

Video question answering via grounded cross-attention network learning
Yunan Ye ... Jun Xiao
Information Processing & Management | VOL. 57
Yunan Ye, et. al.Yunan Ye ... Jun Xiao
16 Apr 2020
Information Processing & Management | VOL. 57

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li ... Jingjing Liu
-
Linjie Li, et. al.Linjie Li ... Jingjing Liu
01 Jan 2020
01 Jan 2020

DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering
Guan-Ting Lin ... Shu-Wen Yang
-
Guan-Ting Lin, et. al.Guan-Ting Lin ... Shu-Wen Yang
18 Sep 2022
18 Sep 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TEVL: Trilinear Encoder for Video-language Representation Learning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications