In recent years, deep learning algorithms have rapidly revolutionized artificial intelligence, particularly machine learning, enabling researchers and practitioners to extend previously hand-crafted feature extraction procedures. In particular, deep learning uses adaptive learning processes to learn more complex and informative patterns from datasets of varying sizes. With the increasing availability of multimodal data streams and recent advances in deep learning algorithms, multimodal deep learning is on the rise. This requires the development of complex models that can process and analyze multimodal information in a consistent manner. However, unstructured data can come in many different forms (also known as modalities). Extracting relevant features from this data remains an ambitious goal for deep learning researchers. According to the literature, most deep learning systems consist of a single architecture (i.e., standalone deep learning). When two or more deep learning architectures are combined over multiple sensory modalities, the result is called a multimodal hybrid deep learning model. Since this research direction has received much attention in the field of deep learning, the purpose of this survey is to provide a broader overview of the topic. In this paper, we provide a comprehensive review of recent advances in multimodal hybrid deep learning, including a thorough analysis of the most commonly developed hybrid architectures. In particular, one of the main challenges in multimodal hybrid analysis is the ability of these architectures to systematically integrate cross-modal features in hybrid designs. Therefore, we propose a generic framework for multimodal hybrid learning that focuses mainly on fusion methods. We also identify trends and challenges in multimodal hybrid learning and provide insights and directions for future research. Our findings show that multimodal hybrid learning can perform well in a variety of challenging computer vision applications and tasks.
Read full abstract