Introduction to the Special Issue on Deep Learning for Multi-Modal Intelligence Across Speech, Language, Vision, and Heterogeneous Signals

Xiaodong He,Li Deng,Chao Zhang,Isabel Trancoso,Minlie Huang,Richard Rose

doi:10.1109/jstsp.2020.2989852

Abstract

The ten papers included in this special section focus on deep learning for multi-modal intelligence across speech, language, vision, and heterogeneous signals. Thanks to the disruptive advances in deep learning, significant progress has been made in artificial intelligence (AI) applications with single modality, such as speech recognition, speech synthesis, image classification, object detection, as well as machine translation and reading comprehension, etc. However, many AI problems require more than one modality, and techniques developed for different modalities can often be successfully cross-fertilized. Therefore, the studies on the modeling and learning approaches across multiple modalities are of great interest. This special issue brings together a diverse but complementary set of contributions on emerging deep learning methods for problems based on multiple modalities including speech, text, image and video.

Full Text