I-Code: An Integrative and Composable Multimodal Learning Framework

Ziyi Yang,Liyang Lu,Xuedong Huang,Michael Zeng,Yujia Xie,Robert Gmyr,Noel Codella,Dongdong Chen,Reid Pryzant,Yuwei Fang,Bin Xiao,Naoyuki Kanda,Takuya Yoshioka,Mengchun Gao ,Zhu C ,Yi‐Ling Chen ,Shuang Yu ,Yuan Liu ,Xu Yi‐Chong ,Qian Ye

doi:10.1609/aaai.v37i9.26290

Abstract

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel merge- and co-attention mechanisms to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five multimodal understanding tasks and single-modality benchmarks, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

I-Code: An Integrative and Composable Multimodal Learning Framework

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 26, 2023
Citations: 9

Similar Papers

MFF-Net: Multimodal Feature Fusion Network for 3D Object Detection
Peicheng Shi ... Heng Qi
Computers, Materials & Continua | VOL. 75
Peicheng Shi, et. al.Peicheng Shi ... Heng Qi
01 Jan 2023
Computers, Materials & Continua | VOL. 75

Citrus Huanglongbing Detection Based on Multi-Modal Feature Fusion Learning.
Dongzi Yang ... Xiaoling Deng
Frontiers in Plant Science | VOL. 12
Dongzi Yang, et. al.Dongzi Yang ... Xiaoling Deng
23 Dec 2021
Frontiers in Plant Science | VOL. 12

Semantic‐enhanced multimodal fusion network for fake news detection
Shuo Li ... Lianshan Yan
International Journal of Intelligent Systems | VOL. 37
Shuo Li, et. al.Shuo Li ... Lianshan Yan
22 Sep 2022
International Journal of Intelligent Systems | VOL. 37

M2F-Net: A Deep Learning-Based Multimodal Classification with High-Throughput Phenotyping for Identification of Overabundance of Fertilizers
J Dhakshayani ... B Surendiran
Agriculture | VOL. 13
J Dhakshayani, et. al.J Dhakshayani ... B Surendiran
13 Jun 2023
Agriculture | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

I-Code: An Integrative and Composable Multimodal Learning Framework

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence