VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.

Jing Liu,Sihan Chen,Xingjian He,Longteng Guo,Xinxin Zhu,Weining Wang,Jinhui Tang

doi:10.1109/tpami.2024.3479776

Abstract

In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multimodal understanding and generation. Unlike widely-studied vision-language pretraining models, VALOR jointly models the relationships among vision, audio, and language in an end-to-end manner. It consists of three separate encoders for single modality representations and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain the VALOR model: Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language, and audio into the same common space, simultaneously building vision-language, audio-language, and audiovisual-language alignment. MGC learns to generate text tokens under conditions of vision, audio, or both. To promote vision-audio-language pretraining research, we construct a large-scale, high-quality tri-modality dataset named VALOR-1M, containing 1 million audible videos with human-annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and generalize to various downstream tasks (e.g., retrieval, captioning, and question answering) with different input modalities (e.g., vision-language, audio-language, and audiovisual-language). VALOR achieves new state-of-the-art performance on a series of public cross-modality benchmarks. Code and data are available on the project page at https://casia-iva-group.github.io/projects/VALOR.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on pattern analysis and machine intelligence

Lead the way for us

Similar Papers

MPE4G : Multimodal Pretrained Encoder for Co-Speech Gesture Generation
Gwantae Kim ... Seonghyeok Noh
-
Gwantae Kim, et. al.Gwantae Kim ... Seonghyeok Noh
04 Jun 2023
04 Jun 2023

An extensive study on pre-trained models for program understanding and generation
Zhengran Zeng ... Yuqun Zhang
-
Zhengran Zeng, et. al.Zhengran Zeng ... Yuqun Zhang
18 Jul 2022
18 Jul 2022

PRAL: A Tailored Pre-Training Model for Task-Oriented Dialog Generation
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models
Yiwen Tang ... Bin Zhao
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Yiwen Tang, et. al.Yiwen Tang ... Bin Zhao
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on pattern analysis and machine intelligence