A Review on Methods and Applications in Multimodal Deep Learning

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.

Similar Papers
  • Supplementary Content
  • Cite Count Icon 28
  • 10.1093/genetics/iyae161
A review of multimodal deep learning methods for genomic-enabled predictionin plant breeding
  • Nov 5, 2024
  • Genetics
  • Osval A Montesinos-López + 9 more

Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.

  • Dissertation
  • 10.32657/10356/182346
Data efficient deep multimodal learning
  • Jan 1, 2025
  • Meng Shen

Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.jdent.2023.104588
Multi-modal deep learning for automated assembly of periapical radiographs
  • Jun 21, 2023
  • Journal of Dentistry
  • L Pfänder + 5 more

Multi-modal deep learning for automated assembly of periapical radiographs

  • Research Article
  • Cite Count Icon 58
  • 10.1016/j.imavis.2025.105509
A systematic review of intermediate fusion in multimodal deep learning for biomedical applications
  • May 1, 2025
  • Image and Vision Computing
  • Valerio Guarrasi + 6 more

Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.

  • Research Article
  • Cite Count Icon 3
  • 10.2174/0115748936289033240424071522
Multimodal Deep Learning for Cancer Survival Prediction: A Review
  • May 1, 2025
  • Current Bioinformatics
  • Ge Zhang + 6 more

Background: Cancer has emerged as the "leading killer" of human health. Survival prediction is a crucial branch of cancer prognosis. It aims to estimate patients' survival risk based on their disease conditions. Accurate and efficient survival prediction is vital in cancer patients' treatment and clinical management, preventing unnecessary suffering and conserving precious medical resources. Deep learning has been extensively applied in cancer diagnosis, prognosis, and treatment management. The decreasing cost of next-generation sequencing, continuous development of related databases, and in-depth research on multimodal deep learning have provided opportunities for establishing more functionally rich and accurate survival prediction models. Objective: The current area of cancer survival prediction still lacks a review of multimodal deep learning methods. Methods: We conducted a statistical analysis of the relevant research on multimodal deep learning for cancer survival prediction. We first filtered keywords from 6 known relevant papers. Then, we searched PubMed and Google Scholar for relevant publications from 2018 to 2022 using "Multimodal", "Deep Learning" and "Cancer Survival Prediction" as keywords. Then, we further searched the related publications through the backward and forward citation search. Subsequently, we conducted a detailed analysis and review of these studies based on their datasets and methods. Results: We present a comprehensive systematic review of the multimodal deep learning research on cancer survival prediction from 2018 to 2022. Conclusion: Multimodal deep learning has demonstrated powerful data aggregation capabilities and excellent performance in improving cancer survival prediction greatly. It has made a significant positive impact on facilitating the advancement of automated cancer diagnosis and precision oncology.

  • Research Article
  • Cite Count Icon 400
  • 10.1007/s00371-021-02166-7
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
  • Jun 10, 2021
  • The Visual Computer
  • Khaled Bayoudh + 3 more

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

  • Research Article
  • 10.55041/ijsrem47033
Human Emotion Recognition Using Multi-modal Deep Learning: A Review of Methods, Datasets, and Challenges
  • May 6, 2025
  • INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Ayushi Parmar

Abstract: Human emotion recognition plays a vital role in the field of affective computing and finds wide-ranging applications in healthcare, education, robotics, and human-computer interaction. Traditional unimodal approaches—based solely on facial expressions, speech, or physiological signals—often face limitations due to varying environmental conditions, individual differences, and signal noise. To address these challenges, the use of multi-modal deep learning has gained significant momentum, as it combines multiple data streams such as visual, auditory, textual, and physiological inputs to enhance the accuracy and robustness of emotion detection. This review paper presents a detailed examination of recent developments in the area of multi-modal deep learning for human emotion recognition. We explore various deep learning models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Transformer-based architectures, and their effectiveness in processing different modalities. In addition, we study various fusion strategies—early, late, and hybrid fusion—and their respective contributions towards improving recognition performance. The aim of this paper is to provide researchers and practitioners with valuable insights into the current landscape, ongoing challenges, and future opportunities in this rapidly growing domain. Keywords: Keywords—Human Emotion Recognition, Multi-modal Deep Learning, Emotion Detection, Deep Learning Architectures, Multi-modal Fusion

  • Research Article
  • Cite Count Icon 19
  • 10.1016/j.neucom.2018.09.005
An effective hierarchical extreme learning machine based multimodal fusion framework
  • Sep 19, 2018
  • Neurocomputing
  • Fang Du + 4 more

An effective hierarchical extreme learning machine based multimodal fusion framework

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 12
  • 10.3390/app12157477
Recurrent Neural Network-Based Multimodal Deep Learning for Estimating Missing Values in Healthcare
  • Jul 26, 2022
  • Applied Sciences
  • Joo-Chang Kim + 1 more

This estimation method operates by integrating the input values that are redundantly collected from heterogeneous devices through the selection of a representative value and estimating missing values by using a multimodal RNN. Users use a heterogeneous healthcare platform mainly in a mobile environment. Users who pay a relatively large amount of attention to healthcare possess various types of healthcare devices and collect data through their mobile devices. The collected data may be duplicated depending on the types of these devices. This data duplication causes an ambiguity issue in that it is difficult to determine which value among multiple data should be taken as the user’s actual value. Accordingly, it is necessary to create a neural network structure that considers the data value at the time previous to the current time. RNNs are appropriate for handling data with a time series characteristic. To learn an RNN-based neural network, learning data that have the same time step are required. Therefore, an RNN in which one variable becomes single-modal was designed for each learning run. In the RNN, a cell is a gated recurrent unit (GRU) cell that presents sufficient accuracy in the small resource environment of mobile devices. The RNNs that are learned according to the variables can each operate without additional learning, even if the situation of the user’s mobile device changes. In a heterogeneous environment, missing values are generated by various types of errors, including errors caused by battery charge and discharge, sensor failure, equipment exchange, and near-field communication errors. The higher the missing value ratio, the greater the number of errors that are likely to occur. For this reason, to achieve a more stable heterogeneous health platform, missing values must be considered. In this study, a missing value was estimated by means of multimodal deep learning; that is, a multimodal deep learning method was designed with one neural network that was connected with each learned single-modal RNN using a fully connected network (FCN). Each RNN input value delivers mutual influence through the weights of the FCN, and thereby, it is possible to estimate an output value even if any one of the input values is missing. According to the evaluation in terms of representative value selection, when a representative value was selected by using the mean or median, the most stable service was achieved. As a result of the evaluation according to the estimation method, the accuracy of the RNN-based multimodal deep learning method is 3.91%p higher than that of the SVD method.

  • Research Article
  • Cite Count Icon 121
  • 10.1016/j.inffus.2023.102217
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges
  • Dec 30, 2023
  • Information Fusion
  • Khaled Bayoudh

A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges

  • Research Article
  • 10.1142/s0219467826500191
3D Image Reconstruction Using Virtual Reality and Multimodal Deep Learning
  • Aug 22, 2024
  • International Journal of Image and Graphics
  • Yong Chen

Three-dimensional (3D) image reconstruction techniques have found extensive applications in fields such as medicine, education, and computer science, enabling high-precision 3D images to enhance work efficiency. However, traditional methods of 3D image reconstruction solely rely on acoustic information, resulting in limited accuracy. Therefore, a novel method based on virtual reality (VR) and multimodal deep learning is proposed for 3D image reconstruction. First, VR technology is employed to capture 3D image information, followed by de-noising and removal of redundant information. Second, a logarithmic transformation method is employed to enhance the details in the 3D image. Finally, a multimodal deep learning method is utilized to reconstruct the 3D image from the perspectives of imagery, sound, and video. Experimental results demonstrate that the proposed method achieves superior 3D image reconstruction with an accuracy of over 90%. The reconstruction process is efficient and exhibits low signal-to-noise ratio, while the average registration error is less than 0.04%. These findings highlight the practical value and potential applications of the proposed method.

  • Conference Article
  • Cite Count Icon 16
  • 10.1109/itsc.2018.8569659
Predicting Hazardous Driving Events Using Multi-Modal Deep Learning Based on Video Motion Profile and Kinematics Data
  • Nov 1, 2018
  • Z Gao + 5 more

As the raising of traffic accidents caused by commercial vehicle drivers, more regulations have been issued for improving their safety status. Driving record instruments are required to be installed on such vehicles in China. The obtained naturalistic driving data offer insight into the causal factors of hazardous events with the requirements to identify where hazardous events happen within large volumes of data. In this study, we develop a model based on a low-definition driving record instrument and the vehicle kinematic data for post-accident analysis by multi-modal deep learning method. With a higher camera position on commercial vehicles than cars that can observe further distance, motion profiles are extracted from driving video to capture the trajectory features of front vehicles at different depths. Then random forest is used to select significant kinematic variables which can reflect the potential crash. Finally, a multi-modal deep convolutional neural network (DCNN) combined both video and kinematic data is developed to identify potential collision risk in each 12-second vehicle trip. The analysis results indicate that the proposed multi-modal deep learning model can identify hazardous events within a large volumes of data at an AUC of 0.81, which outperforms the state-of-the-art random forest model and kinematic threshold method.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 13
  • 10.3389/fpls.2023.1094142
Study on the detection of water status of tomato (Solanum lycopersicum L.) by multimodal deep learning
  • May 31, 2023
  • Frontiers in Plant Science
  • Zhiyu Zuo + 7 more

Water plays a very important role in the growth of tomato (Solanum lycopersicum L.), and how to detect the water status of tomato is the key to precise irrigation. The objective of this study is to detect the water status of tomato by fusing RGB, NIR and depth image information through deep learning. Five irrigation levels were set to cultivate tomatoes in different water states, with irrigation amounts of 150%, 125%, 100%, 75%, and 50% of reference evapotranspiration calculated by a modified Penman-Monteith equation, respectively. The water status of tomatoes was divided into five categories: severely irrigated deficit, slightly irrigated deficit, moderately irrigated, slightly over-irrigated, and severely over-irrigated. RGB images, depth images and NIR images of the upper part of the tomato plant were taken as data sets. The data sets were used to train and test the tomato water status detection models built with single-mode and multimodal deep learning networks, respectively. In the single-mode deep learning network, two CNNs, VGG-16 and Resnet-50, were trained on a single RGB image, a depth image, or a NIR image for a total of six cases. In the multimodal deep learning network, two or more of the RGB images, depth images and NIR images were trained with VGG-16 or Resnet-50, respectively, for a total of 20 combinations. Results showed that the accuracy of tomato water status detection based on single-mode deep learning ranged from 88.97% to 93.09%, while the accuracy of tomato water status detection based on multimodal deep learning ranged from 93.09% to 99.18%. The multimodal deep learning significantly outperformed the single-modal deep learning. The tomato water status detection model built using a multimodal deep learning network with ResNet-50 for RGB images and VGG-16 for depth and NIR images was optimal. This study provides a novel method for non-destructive detection of water status of tomato and gives a reference for precise irrigation management.

  • Research Article
  • 10.55041/ijsrem52491
Multi-Modal Learning Approaches Combining EHR, Imaging, and Genomic Data
  • Sep 9, 2025
  • INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Veerendra Nath Jasthi

Abstract— Processing data-driven healthcare allowed us unprecedented chances to enhance diagnoses, foreseen, and customized treatment by means of multi-modal learning. The present paper discusses the development of electronic health records (EHR), medical images, and genomic data through multi-modal deep learning. Multi-modal models are able to capture richer feature representations and more complex patterns not visible with unimodal processing through the use of heterogeneous data sources, and thus by combining their complementary strengths. We propose an end-to-end protocol to align, preprocess, and fuse modalities and demonstrate an application of deep neural networks learning in tandem about these structured pieces of EHR and high dimensional imaging attributes alongside gene expression data. Through experiments, it is revealed that the proposed model has better performance on the task of disease classification and patient stratification compared to single-modality counterparts. The paper highlights the need to not only ensure data alignment, imputation of missing modalities and learning representations specifically in the domain of modalities to fully utilize multi-modal in the clinical context. Keywords— Multi-modal Learning, Electronic Health Records (EHR), Medical Imaging, Genomic Data, Deep Learning, Data Fusion, Healthcare AI, Precision Medicine, Patient Stratification, Biomedical Informatics.

  • Research Article
  • Cite Count Icon 7
  • 10.1038/s41598-025-10512-1
A multimodal deep reinforcement learning approach for IoT-driven adaptive scheduling and robustness optimization in global logistics networks
  • Jul 12, 2025
  • Scientific Reports
  • Yao Lu

This paper presents an approach for adaptive scheduling and robustness optimization in global logistics networks by integrating multimodal deep reinforcement learning with Internet of Things (IoT) technologies. We propose an integrated framework comprising a multimodal data fusion mechanism that synthesizes heterogeneous IoT sensor data, historical records, and contextual information; an adaptive deep reinforcement learning architecture that generates dynamic scheduling policies; and a multi-objective robust optimization method that balances operational efficiency with system resilience. The framework addresses key challenges in global logistics including demand volatility, transportation disruptions, and environmental uncertainties. Comprehensive experiments conducted on real-world logistics datasets demonstrate that our approach outperforms traditional methods with an 18.7% reduction in operational costs, 12.4% improvement in service levels, and significantly enhanced robustness under various disruption scenarios. The proposed method maintains 83% performance stability during complex disruptions compared to 51–72% for alternative approaches, while keeping computational requirements feasible for practical deployment. This research demonstrates potential contributions to AI-driven logistics operations management by showing improved supply chain performance through multimodal learning and robust optimization techniques.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant