Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Chao Zhang,Zichao Yang,Li Deng,Xiaodong He

doi:10.1109/jstsp.2020.2987728

Abstract

Deep learning methods have revolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Abstract

Talk to us

Similar Papers

More From: IEEE Journal of Selected Topics in Signal Processing

Lead the way for us

Journal: IEEE Journal of Selected Topics in Signal Processing	Publication Date: Mar 1, 2020
Citations: 459

Similar Papers

Effectiveness Analysis of Entrepreneurial Legal Risk Prevention Based on Multi-Modal Deep Learning Model
Tianhua Li ... Shaowei Qu
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Tianhua Li, et. al.Tianhua Li ... Shaowei Qu
22 Jun 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

Multimodal deep representation learning for protein interaction identification and protein family classification
Da Zhang ... Mansur Kabuka
BMC Bioinformatics | VOL. 20
Da Zhang, et. al.Da Zhang ... Mansur Kabuka
01 Dec 2019
BMC Bioinformatics | VOL. 20

A review of multimodal deep learning methods for genomic-enabled prediction in plant breeding.
Osval A Montesinos-López ... José Crossa
Genetics | VOL. -
Osval A Montesinos-López, et. al.Osval A Montesinos-López ... José Crossa
05 Nov 2024
Genetics | VOL. -

Deep Multimodal Representation Learning: A Survey
Wenzhong Guo ... Jianwen Wang
IEEE Access | VOL. 7
Wenzhong Guo, et. al.Wenzhong Guo ... Jianwen Wang
01 Jan 2019
IEEE Access | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Abstract

Talk to us

Similar Papers

More From: IEEE Journal of Selected Topics in Signal Processing