A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

Similar Papers
  • Research Article
  • Cite Count Icon 121
  • 10.1016/j.inffus.2023.102217
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges
  • Dec 30, 2023
  • Information Fusion
  • Khaled Bayoudh

A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges

  • Dissertation
  • 10.32657/10356/182346
Data efficient deep multimodal learning
  • Jan 1, 2025
  • Meng Shen

Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.

  • Supplementary Content
  • Cite Count Icon 28
  • 10.1093/genetics/iyae161
A review of multimodal deep learning methods for genomic-enabled predictionin plant breeding
  • Nov 5, 2024
  • Genetics
  • Osval A Montesinos-LĂłpez + 9 more

Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.jdent.2023.104588
Multi-modal deep learning for automated assembly of periapical radiographs
  • Jun 21, 2023
  • Journal of Dentistry
  • L Pfänder + 5 more

Multi-modal deep learning for automated assembly of periapical radiographs

  • Research Article
  • Cite Count Icon 53
  • 10.1016/j.ins.2022.12.014
Analysis of multimodal data fusion from an information theory perspective
  • Dec 9, 2022
  • Information Sciences
  • Yinglong Dai + 4 more

Analysis of multimodal data fusion from an information theory perspective

  • Research Article
  • Cite Count Icon 54
  • 10.1088/1361-6560/ac4c47
Deep multimodal learning for lymph node metastasis prediction of primary thyroid cancer
  • Feb 1, 2022
  • Physics in Medicine & Biology
  • Xinglong Wu + 3 more

Objective. The incidence of primary thyroid cancer has risen steadily over the past decades because of overdiagnosis and overtreatment through the improvement in imaging techniques for screening, especially in ultrasound examination. Metastatic status of lymph nodes is important for staging the type of primary thyroid cancer. Deep learning algorithms based on ultrasound images were thus developed to assist radiologists on the diagnosis of lymph node metastasis. The objective of this study is to integrate more clinical context (e.g., health records and various image modalities) into, and explore more interpretable patterns discovered by, deep learning algorithms for the prediction of lymph node metastasis in primary thyroid cancer patients. Approach. A deep multimodal learning network was developed in this study with a novel index proposed to compare the contribution of different modalities when making the predictions. Main results. The proposed multimodal network achieved an average F1 score of 0.888 and an average area under the receiver operating characteristic curve (AUC) value of 0.973 in two independent validation sets, and the performance was significantly better than that of three single-modality deep learning networks. Moreover, among three modalities used in this study, the deep multimodal learning network relied generally more on image modalities than the data modality of clinic records when making the predictions. Significance. Our work is beneficial to prospective clinic trials of radiologists on the diagnosis of lymph node metastasis in primary thyroid cancer, and will better help them understand how the predictions are made in deep multimodal learning algorithms.

  • Research Article
  • Cite Count Icon 58
  • 10.1016/j.imavis.2025.105509
A systematic review of intermediate fusion in multimodal deep learning for biomedical applications
  • May 1, 2025
  • Image and Vision Computing
  • Valerio Guarrasi + 6 more

Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.

  • Research Article
  • Cite Count Icon 7
  • 10.1038/s41598-025-10512-1
A multimodal deep reinforcement learning approach for IoT-driven adaptive scheduling and robustness optimization in global logistics networks
  • Jul 12, 2025
  • Scientific Reports
  • Yao Lu

This paper presents an approach for adaptive scheduling and robustness optimization in global logistics networks by integrating multimodal deep reinforcement learning with Internet of Things (IoT) technologies. We propose an integrated framework comprising a multimodal data fusion mechanism that synthesizes heterogeneous IoT sensor data, historical records, and contextual information; an adaptive deep reinforcement learning architecture that generates dynamic scheduling policies; and a multi-objective robust optimization method that balances operational efficiency with system resilience. The framework addresses key challenges in global logistics including demand volatility, transportation disruptions, and environmental uncertainties. Comprehensive experiments conducted on real-world logistics datasets demonstrate that our approach outperforms traditional methods with an 18.7% reduction in operational costs, 12.4% improvement in service levels, and significantly enhanced robustness under various disruption scenarios. The proposed method maintains 83% performance stability during complex disruptions compared to 51–72% for alternative approaches, while keeping computational requirements feasible for practical deployment. This research demonstrates potential contributions to AI-driven logistics operations management by showing improved supply chain performance through multimodal learning and robust optimization techniques.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.acra.2024.12.018
Multimodal Deep Learning Fusing Clinical and Radiomics Scores for Prediction of Early-Stage Lung Adenocarcinoma Lymph Node Metastasis.
  • May 1, 2025
  • Academic radiology
  • Chengcheng Xia + 8 more

Multimodal Deep Learning Fusing Clinical and Radiomics Scores for Prediction of Early-Stage Lung Adenocarcinoma Lymph Node Metastasis.

  • Conference Article
  • 10.1109/icerect56837.2022.10060086
Multi-Modal Colour Extraction Using Deep Learning Techniques
  • Dec 26, 2022
  • Karthik Kulkarni + 2 more

Multimodal learning research has advanced quickly over the past ten years in a variety of fields, especially computer vision. Due to the growing opportunities of multimodal streaming data and deep learning algorithms, deep multimodal learning is increasingly common. This calls for the development of models that can reliably handle and interpret the multimodal data. Unstructured real-world data, often known as modalities, can naturally assume many different shapes, including both text and images often. Deep learning researchers are still driven by the need to extract useful patterns from this type of data. It is crucial to have well-organized product catalogues to enhance customers' experiences as they explore the plethora of possibilities provided by online marketplaces. The availability of product characteristics like colour or material is a crucial component of it. However, attribute data is frequently erroneous or absent on several of the markets we focus on. Utilizing deep models that have been trained on huge corpora to predict features from unstructured data, such as product descriptions and photographs (referred to as modalities in this study), is one potential approach to solving this issue. To receive a comprehensive rundown of the various multi-modal colour extraction techniques and their advantages, drawbacks and open challenges.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 17
  • 10.3389/frai.2023.1247195
Multimodal deep learning for liver cancer applications: a scoping review.
  • Oct 27, 2023
  • Frontiers in artificial intelligence
  • Aisha Siam + 5 more

Hepatocellular carcinoma is a malignant neoplasm of the liver and a leading cause of cancer-related deaths worldwide. The multimodal data combines several modalities, such as medical images, clinical parameters, and electronic health record (EHR) reports, from diverse sources to accomplish the diagnosis of liver cancer. The introduction of deep learning models with multimodal data can enhance the diagnosis and improve physicians' decision-making for cancer patients. This scoping review explores the use of multimodal deep learning techniques (i.e., combining medical images and EHR data) in diagnosing and prognosis of hepatocellular carcinoma (HCC) and cholangiocarcinoma (CCA). A comprehensive literature search was conducted in six databases along with forward and backward references list checking of the included studies. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) extension for scoping review guidelines were followed for the study selection process. The data was extracted and synthesized from the included studies through thematic analysis. Ten studies were included in this review. These studies utilized multimodal deep learning to predict and diagnose hepatocellular carcinoma (HCC), but no studies examined cholangiocarcinoma (CCA). Four imaging modalities (CT, MRI, WSI, and DSA) and 51 unique EHR records (clinical parameters and biomarkers) were used in these studies. The most frequently used medical imaging modalities were CT scans followed by MRI, whereas the most common EHR parameters used were age, gender, alpha-fetoprotein AFP, albumin, coagulation factors, and bilirubin. Ten unique deep-learning techniques were applied to both EHR modalities and imaging modalities for two main purposes, prediction and diagnosis. The use of multimodal data and deep learning techniques can help in the diagnosis and prediction of HCC. However, there is a limited number of works and available datasets for liver cancer, thus limiting the overall advancements of AI for liver cancer applications. Hence, more research should be undertaken to explore further the potential of multimodal deep learning in liver cancer applications.

  • Research Article
  • Cite Count Icon 160
  • 10.1145/3545572
A Review on Methods and Applications in Multimodal Deep Learning
  • Feb 17, 2023
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Summaira Jabeen + 5 more

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.

  • Conference Article
  • Cite Count Icon 44
  • 10.1109/icmlc48188.2019.8949228
Multimodal Representation Learning: Advances, Trends and Challenges
  • Jul 1, 2019
  • Su-Fang Zhang + 4 more

Representation learning is the base and crucial for consequential tasks, such as classification, regression, and recognition. The goal of representation learning is to automatically learning good features with deep models. Multimodal representation learning is a special representation learning, which automatically learns good features from multiple modalities, and these modalities are not independent, there are correlations and associations among modalities. Furthermore, multimodal data are usually heterogeneous. Due to the characteristics, multimodal representation learning poses many difficulties: how to combine multimodal data from heterogeneous sources; how to jointly learning features from multimodal data; how to effectively describe the correlations and associations, etc. These difficulties triggered great interest of researchers along with the upsurge of deep learning, many deep multimodal learning methods have been proposed by different researchers. In this paper, we present an overview of deep multimodal learning, especially the approaches proposed within the last decades. We provide potential readers with advances, trends and challenges, which can be very helpful to researchers in the field of machine, especially for the ones engaging in the study of multimodal deep machine learning.

  • Research Article
  • Cite Count Icon 3
  • 10.2174/0115748936289033240424071522
Multimodal Deep Learning for Cancer Survival Prediction: A Review
  • May 1, 2025
  • Current Bioinformatics
  • Ge Zhang + 6 more

Background: Cancer has emerged as the "leading killer" of human health. Survival prediction is a crucial branch of cancer prognosis. It aims to estimate patients' survival risk based on their disease conditions. Accurate and efficient survival prediction is vital in cancer patients' treatment and clinical management, preventing unnecessary suffering and conserving precious medical resources. Deep learning has been extensively applied in cancer diagnosis, prognosis, and treatment management. The decreasing cost of next-generation sequencing, continuous development of related databases, and in-depth research on multimodal deep learning have provided opportunities for establishing more functionally rich and accurate survival prediction models. Objective: The current area of cancer survival prediction still lacks a review of multimodal deep learning methods. Methods: We conducted a statistical analysis of the relevant research on multimodal deep learning for cancer survival prediction. We first filtered keywords from 6 known relevant papers. Then, we searched PubMed and Google Scholar for relevant publications from 2018 to 2022 using "Multimodal", "Deep Learning" and "Cancer Survival Prediction" as keywords. Then, we further searched the related publications through the backward and forward citation search. Subsequently, we conducted a detailed analysis and review of these studies based on their datasets and methods. Results: We present a comprehensive systematic review of the multimodal deep learning research on cancer survival prediction from 2018 to 2022. Conclusion: Multimodal deep learning has demonstrated powerful data aggregation capabilities and excellent performance in improving cancer survival prediction greatly. It has made a significant positive impact on facilitating the advancement of automated cancer diagnosis and precision oncology.

  • Research Article
  • Cite Count Icon 735
  • 10.1162/neco_a_01273
A Survey on Deep Learning for Multimodal Data Fusion.
  • May 1, 2020
  • Neural Computation
  • Jing Gao + 3 more

With the wide deployments of heterogeneous networks, huge amounts of data with characteristics of high volume, high variety, high velocity, and high veracity are generated. These data, referred to multimodal big data, contain abundant intermodality and cross-modality information and pose vast challenges on traditional data fusion methods. In this review, we present some pioneering deep learning models to fuse these multimodal big data. With the increasing exploration of the multimodal big data, there are still some challenges to be addressed. Thus, this review presents a survey on deep learning for multimodal data fusion to provide readers, regardless of their original community, with the fundamentals of multimodal deep learning fusion method and to motivate new multimodal data fusion techniques of deep learning. Specifically, representative architectures that are widely used are summarized as fundamental to the understanding of multimodal deep learning. Then the current pioneering multimodal data fusion deep learning models are summarized. Finally, some challenges and future topics of multimodal data fusion deep learning models are described.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant