Data efficient deep multimodal learning
Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.
- Research Article
4
- 10.1016/j.jdent.2023.104588
- Jun 21, 2023
- Journal of Dentistry
Multi-modal deep learning for automated assembly of periapical radiographs
- Research Article
- 10.55041/ijsrem52491
- Sep 9, 2025
- INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Abstract— Processing data-driven healthcare allowed us unprecedented chances to enhance diagnoses, foreseen, and customized treatment by means of multi-modal learning. The present paper discusses the development of electronic health records (EHR), medical images, and genomic data through multi-modal deep learning. Multi-modal models are able to capture richer feature representations and more complex patterns not visible with unimodal processing through the use of heterogeneous data sources, and thus by combining their complementary strengths. We propose an end-to-end protocol to align, preprocess, and fuse modalities and demonstrate an application of deep neural networks learning in tandem about these structured pieces of EHR and high dimensional imaging attributes alongside gene expression data. Through experiments, it is revealed that the proposed model has better performance on the task of disease classification and patient stratification compared to single-modality counterparts. The paper highlights the need to not only ensure data alignment, imputation of missing modalities and learning representations specifically in the domain of modalities to fully utilize multi-modal in the clinical context. Keywords— Multi-modal Learning, Electronic Health Records (EHR), Medical Imaging, Genomic Data, Deep Learning, Data Fusion, Healthcare AI, Precision Medicine, Patient Stratification, Biomedical Informatics.
- Research Article
60
- 10.1016/j.imavis.2025.105509
- May 1, 2025
- Image and Vision Computing
Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.
- Research Article
403
- 10.1007/s00371-021-02166-7
- Jun 10, 2021
- The Visual Computer
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.
- Research Article
163
- 10.1145/3545572
- Feb 17, 2023
- ACM Transactions on Multimedia Computing, Communications, and Applications
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.
- Supplementary Content
28
- 10.1093/genetics/iyae161
- Nov 5, 2024
- Genetics
Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.
- Research Article
28
- 10.18034/ajhal.v4i2.658
- Dec 31, 2017
- Asian Journal of Humanity, Art and Literature
A modality is an event or experience. Life is multimodal, see, hear, smell, feel, and taste. Multimodal experiences involve some world modalities. Artificial intelligence must grasp multimodal views to understand our surroundings. Multimodal machine learning models interact and correlate input from several modalities. It's a multi-disciplinary field with great potential. In this study, we analyze emerging multimodal machine learning technologies and categorize them scientifically rather than focusing on specific multimodal applications. Multimodal machine learning offers more potential and problems than classifications. Most multimodal learning research collects quantitative data from polls and surveys. This research reviews a detailed library of observational studies on multimodal data (MMD) skills for human learning using artificial intelligence-powered approaches including Machine Learning and Deep Learning. This research also describes how MMD has improved learning and in what environments. This paper discusses multimodal learning and its ongoing improvements and approaches to improving learning. Finally, future researchers should carefully consider building a system that aligns multimodal aspects with the study and learning plan. These elements could enhance multimodal learning by facilitating theory and practice activities. This research lays the groundwork for multimodal data use in future learning technologies and development.
- Preprint Article
- 10.5194/egusphere-egu23-5818
- May 15, 2023
In general, water level prediction models using deep learning techniques have been developed using time-series water level observation data from upstream water level stations and target water level stations even though many of physical data are necessary to predict water level. The changes of the water level are greatly affected by rainfall in the basin, therefore rainfall information is needed to more accurately predict the water level. In particular, radar data has the advantage of being able to directly acquire the amount of rainfall occurring within a watershed. This study aims to develop the multimodal deep learning model to predict the water level using 2D grid radar rainfall data and 1D time-series water level observation data. This study proposed two multimodal deep learning models which have different structures. Both multimodal deep learning models predict the water level by simultaneously using the observed water level data up to the present time and the radar rainfall data that affects the water level in the future. The first proposed model consists of a deep learning network that links 2D Average Pooling (AvgPool2D), which compresses 2D radar data to 1D data, and Long Short-Term Memory (LSTM), which predicts 1D time series water level data. The second proposed model consists of a deep learning network that predicts water levels by linking Conv2DLSTM and LSTM, which can reflect the characteristics of 2D radar data without deformation.  The two proposed multimodal deep learning models were learned and evaluated in the upper basin of Hantan River. In addition, it was compared with the results of single-modal LSTM using only water level data. There are three water level stations in the study area, and the objective was to predict the water level of the downstream station up to 180 minutes in advance. For learning and verification of the deep learning model, 10-minute water level and radar rainfall data were collected from May 2019 to October 2021. For the radar data used as input, the grid data included in the target watershed were extracted and used among composite radar data with a resolution of 1 km operating by Ministry of Environment. As a result of evaluating each learned deep learning model, two multimodal models had higher prediction accuracy than the single-modal using only water level data. In particular, second proposed model (Conv2dLSTM+LSTM) had better predictive performance than first proposed model (AvgPool2D+LSTM) at the time of the sudden rise in water level due to rainfall.AcknowledgmentsResearch for this paper was carried out under the KICT Research Program (project no. 202200175-001, Development of future-leading technologies solving water crisis against to water disasters affected by climate change) funded by the Ministry of Science and ICT.
- Dissertation
- 10.32657/10356/182226
- Jan 1, 2024
In the past few years, multimodal learning has made significant progress. The goal of multimodal learning is to create models that can relate and process data from various modalities. One of the challenges is to learn useful representations efficiently given the heterogeneity of the data. Another is how to fuse the information from two or more modalities to perform a prediction, which is robust against possibly missing modalities. To reduce these research gaps, this dissertation attempts to develop effective and efficient network modules for both unimodal learning and crossmodal fusion. It also aims to improve the robustness of the fused features for different downstream tasks. In multimodal representation learning, both complementary crossmodal representation fusion and effective unimodal representation are crucial. Some prior works try to modulate one modal feature to another directly. Although it can be effective in aligning the multimodal features, it will ignore both unimodal and crossmodal representation refinements, which is important for multimodal fusion. In this dissertation, we introduce the Unimodal and Crossmodal Refinement Network (UCRN) to enhance both unimodal and crossmodal representations in multimodal learning. We propose a unimodal refinement module that iteratively updates modality-specific representations using transformer-based attention layers, followed by self-quality improvement layers. These refined unimodal representations are then projected into a common latent space and further tuned using a crossmodal refinement module. The results in multiple benchmark datasets show improved performance and robustness against missing modalities and noisy data in multimodal sequence fusion scenarios. Besides representation refinement for better fusion performance, it is also important to reduce the overfitting issue during learning. As the predictive powers between modalities are different, the existing modality gap can lead to overfitting and undermine the fusion performance. This dissertation aims to improve unimodal and crossmodal representations by the proposed regularized expressive representation distillation (RERD) approach. To improve crossmodal optimization and minimize modality gaps before fusion, a multimodal Sinkhorn distance regularizer is introduced, and multi-head distillation encoders with iterative updates are used to refine unimodal representations. We evaluate the proposed method on a range of benchmark datasets. The results show that RERD performs better than current baselines, proving to be an effective method for deep multimodal fusion on sequence datasets. To further improve the robustness of multimodal representations against noisy inputs, we study the robustness in the context of multimodal contrastive learning (MCL), as contrastive learning is effective at discriminating coexisting semantic features (positive) from irrelative ones (negative) in multimodal signals. To address weakness in MCL, this dissertation presents Pace-adaptive and Noise-resistant Noise-Contrastive Estimation (PN-NCE) as a novel self-supervised method for multimodal fusion. We propose to adaptively optimize the similarity between positive and negative pairs and improve robustness against noisy inputs during training. By integrating an estimator to measure modality invariance, PN-NCE achieves consistent performance improvements across various multimodal tasks and datasets and comparable results with supervised learning approaches. To gain more insight into effective and reliable multimodal learning in practical applications, we examine the proposed method of audio-visual deception detection in videos. Deception detection in conversations is a challenging yet important task, having pivotal applications in various fields. The first challenge is the scarcity of high-quality datasets in deception detection research. In this dissertation, we introduce a large gameshow deception detection dataset, DOLOS, with rich multimodal annotations. DOLOS comprises 1,675 video clips with audio-visual annotations featuring 213 subjects. We benchmark deception detection approaches on the DOLOS dataset. Additionally, we propose Parameter-Efficient Crossmodal Learning (PECL), where we propose a Uniform Temporal Adapter and a Plug-in Audio-Visual Fusion module, to enhance performance with fewer parameters and exploit multi-task learning for improved deception detection performance. The Uniform Temporal Adapter module is different from the previous ones in UCRN and RERD because it is lightweight and plug-and-play. In summary, this dissertation focuses on efficient and robust multimodal learning and fusion. To achieve these goals, different methods and modules are proposed to enhance the performance of fused features for downstream tasks. Experimental results on different benchmark datasets and real-world applications show the effectiveness of the proposed method compared with state-of-the-art approaches.
- Research Article
1
- 10.1158/1538-7445.am2024-2313
- Mar 22, 2024
- Cancer Research
Purpose: Cancer patients routinely undergo radiologic and pathologic evaluation for their diagnostic workup. These data modalities represent a valuable and readily available resource for developing new prognostic tools. Given their vast difference in spatial scales, effective methods to integrate the two modalities are currently lacking. Here, we aim to develop a multi-modal approach to integrate radiology and pathology images for predicting outcomes in cancer patients. Methods: We propose a multi-modal weakly-supervised deep learning framework to integrate radiology and pathology images for survival prediction. We first extract multi-scale features from whole-slide H&E-stained pathology images to characterize cellular and tissue phenotypes as well as spatial cellular organization. We then build a hierarchical co-attention transformer to effectively learn the multi-modal interactions between radiology and pathology image features. Finally, a multimodal risk score is derived by combining complementary information from two images modalities and clinical data for predicting outcome. We evaluate our approach in lung, gastric, and brain cancers with matched radiology and pathology images and clinical data available, each with separate training and external validation cohorts. Results: The multi-modal deep learning models achieved a reasonably high accuracy for predicting survival outcomes in the external validation cohorts (C-index range: 0.72-0.75 across three cancer types). The multi-modal prognostic models significantly improved upon single-modal approach based on radiology or pathology images or clinical data alone (C-index range: 0.53-0.71, P<0.01). The multi-modal deep learning models were significantly associated with disease-free survival and overall survival (hazard ratio range: 3.23-4.46, P<0.0001). In multivariable analyses, the models remained an independent prognostic factor (P<0.01) after adjusting for clinicopathological variables including cancer stage and tumor differentiation. Conclusions: The proposed multi-modal deep learning approach outperforms traditional methods for predicting survival outcomes by leveraging routinely available radiology and pathology images. With further independent validation, this may afford a promising approach to improve risk stratification and better inform treatment strategies for cancer patients. Citation Format: Zhe Li, Yuming Jiang, Ruijiang Li. Multi-modal deep learning to predict cancer outcomes by integrating radiology and pathology images [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2313.
- Research Article
123
- 10.1016/j.inffus.2023.102217
- Dec 30, 2023
- Information Fusion
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges
- Research Article
7
- 10.1038/s41598-025-10512-1
- Jul 12, 2025
- Scientific Reports
This paper presents an approach for adaptive scheduling and robustness optimization in global logistics networks by integrating multimodal deep reinforcement learning with Internet of Things (IoT) technologies. We propose an integrated framework comprising a multimodal data fusion mechanism that synthesizes heterogeneous IoT sensor data, historical records, and contextual information; an adaptive deep reinforcement learning architecture that generates dynamic scheduling policies; and a multi-objective robust optimization method that balances operational efficiency with system resilience. The framework addresses key challenges in global logistics including demand volatility, transportation disruptions, and environmental uncertainties. Comprehensive experiments conducted on real-world logistics datasets demonstrate that our approach outperforms traditional methods with an 18.7% reduction in operational costs, 12.4% improvement in service levels, and significantly enhanced robustness under various disruption scenarios. The proposed method maintains 83% performance stability during complex disruptions compared to 51–72% for alternative approaches, while keeping computational requirements feasible for practical deployment. This research demonstrates potential contributions to AI-driven logistics operations management by showing improved supply chain performance through multimodal learning and robust optimization techniques.
- Research Article
2
- 10.1051/0004-6361/202553751
- Jun 1, 2025
- Astronomy & Astrophysics
Aims. This study is aimed at deriving the age and metallicity of galaxies by proposing a novel multi-modal deep learning framework. This multi-modal framework integrates spectral and photometric data, offering advantages in cases where spectra are incomplete or unavailable. Methods. We propose a multi-modal learning method for estimating the age and metallicity of galaxies (MMLforGalAM). This method uses two modalities: spectra and photometric images as training samples. Its architecture consists of four models: a spectral feature extraction model (ℳ1), a simulated spectral feature generation model (ℳ2), an image feature extraction model (ℳ3), and a multi-modal attention regression model (ℳ4). Specifically, ℳ1 extracts spectral features associated with age and metallicity from spectra observed by the Sloan Digital Sky Survey (SDSS). These features are then used as labels to train ℳ2, which generates simulated spectral features for photometric images to address the challenge of missing observed spectra for some images. Overall, ℳ1 and ℳ2 provide a transformation from photometric to spectral features, with the goal of constructing a spectral representation of data pairs (photometric and spectral features) for multi-modal learning. Once ℳ2 is trained, MMLforGalAM can then be applied to scenarios with only images, even in the absence of spectra. Then, ℳ3 processes SDSS photometric images to extract features related to age and metallicity. Finally, ℳ4 combines the simulated spectral features from ℳ2 with the extracted image features from ℳ3 to predict the age and metallicity of galaxies. Results. Trained on 36278 galaxies from SDSS, our model predicts the stellar age and metallicity, with a scatter of 1σ = 0.1506 dex for age and 1 σ = 0.1402 dex for metallicity. Compared to a single-modal model trained using only images, the multi-modal approach reduces the scatter by 27% for age and 15% for metallicity.
- Research Article
46
- 10.1016/j.asoc.2021.107788
- Aug 11, 2021
- Applied Soft Computing
Sentiment-influenced trading system based on multimodal deep reinforcement learning
- Research Article
6
- 10.13374/j.issn2095-9389.2019.03.21.003
- May 1, 2020
- SHILAP Revista de lepidopterología
“Big data” is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand “big data”. The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person’s understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of “media”, a “mode” is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field.