A review of multimodal deep learning methods for genomic-enabled predictionin plant breeding
Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.
- Research Article
160
- 10.1145/3545572
- Feb 17, 2023
- ACM Transactions on Multimedia Computing, Communications, and Applications
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.
- Research Article
3
- 10.2174/0115748936289033240424071522
- May 1, 2025
- Current Bioinformatics
Background: Cancer has emerged as the "leading killer" of human health. Survival prediction is a crucial branch of cancer prognosis. It aims to estimate patients' survival risk based on their disease conditions. Accurate and efficient survival prediction is vital in cancer patients' treatment and clinical management, preventing unnecessary suffering and conserving precious medical resources. Deep learning has been extensively applied in cancer diagnosis, prognosis, and treatment management. The decreasing cost of next-generation sequencing, continuous development of related databases, and in-depth research on multimodal deep learning have provided opportunities for establishing more functionally rich and accurate survival prediction models. Objective: The current area of cancer survival prediction still lacks a review of multimodal deep learning methods. Methods: We conducted a statistical analysis of the relevant research on multimodal deep learning for cancer survival prediction. We first filtered keywords from 6 known relevant papers. Then, we searched PubMed and Google Scholar for relevant publications from 2018 to 2022 using "Multimodal", "Deep Learning" and "Cancer Survival Prediction" as keywords. Then, we further searched the related publications through the backward and forward citation search. Subsequently, we conducted a detailed analysis and review of these studies based on their datasets and methods. Results: We present a comprehensive systematic review of the multimodal deep learning research on cancer survival prediction from 2018 to 2022. Conclusion: Multimodal deep learning has demonstrated powerful data aggregation capabilities and excellent performance in improving cancer survival prediction greatly. It has made a significant positive impact on facilitating the advancement of automated cancer diagnosis and precision oncology.
- Research Article
6
- 10.3390/app15010360
- Jan 2, 2025
- Applied Sciences
In recent years, deep learning has witnessed astonishing success in the field of remote sensing in images. Generally, deep learning requires a large amount of labeled training data. Nevertheless, in remote sensing, sufficient labeled data are scarce because labeled data are often difficult, expensive, or time-consuming to obtain. To address these problems, we propose a deep curriculum learning semi-supervised framework (DCLSSF) for remote sensing image scene classification. This framework employs a multimodal deep curriculum learning method which can realize the classification of images on a range of easy–difficult. Specifically, by utilizing multiple pretrained networks to extract multiple deep features of images as their multimodal feature representations, it can comprehensively mine the information from labeled and unlabeled images from diverse perspectives. Subsequently, a feature fusion method is used on deep features of different modalities to obtain deep fusion features with a strong discrimination ability and low dimensionality. Finally, the multimodal deep features are fed into multimodal curriculum learning methods for classification. Multimodal curriculum learning can integrate the easy curricula recommended by each modal according to the order of the samples of each modal and then learn step by step. Experiments on three publicly available datasets (UC Merced, AID, and NWPU-RESISC45) show that the semi-supervised classification framework achieves high accuracy rates (99.14%, 97.95%, and 93.01%), even surpassing those of the most supervised classification methods. The DCLSSF method can not only fully exploit the rich features extracted by the multimodal deep learning network but can also perform the semi-supervised classification of unlabeled samples in a range of easy–difficult.
- Research Article
400
- 10.1007/s00371-021-02166-7
- Jun 10, 2021
- The Visual Computer
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.
- Research Article
4
- 10.1016/j.jdent.2023.104588
- Jun 21, 2023
- Journal of Dentistry
Multi-modal deep learning for automated assembly of periapical radiographs
- Research Article
121
- 10.1016/j.inffus.2023.102217
- Dec 30, 2023
- Information Fusion
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges
- Research Article
19
- 10.1016/j.neucom.2018.09.005
- Sep 19, 2018
- Neurocomputing
An effective hierarchical extreme learning machine based multimodal fusion framework
- Conference Article
44
- 10.1109/icmlc48188.2019.8949228
- Jul 1, 2019
Representation learning is the base and crucial for consequential tasks, such as classification, regression, and recognition. The goal of representation learning is to automatically learning good features with deep models. Multimodal representation learning is a special representation learning, which automatically learns good features from multiple modalities, and these modalities are not independent, there are correlations and associations among modalities. Furthermore, multimodal data are usually heterogeneous. Due to the characteristics, multimodal representation learning poses many difficulties: how to combine multimodal data from heterogeneous sources; how to jointly learning features from multimodal data; how to effectively describe the correlations and associations, etc. These difficulties triggered great interest of researchers along with the upsurge of deep learning, many deep multimodal learning methods have been proposed by different researchers. In this paper, we present an overview of deep multimodal learning, especially the approaches proposed within the last decades. We provide potential readers with advances, trends and challenges, which can be very helpful to researchers in the field of machine, especially for the ones engaging in the study of multimodal deep machine learning.
- Dissertation
- 10.32657/10356/182346
- Jan 1, 2025
Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.
- Research Article
58
- 10.1016/j.imavis.2025.105509
- May 1, 2025
- Image and Vision Computing
Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.
- Book Chapter
3
- 10.2174/9789815305128124010008
- Oct 10, 2024
Machine learning algorithms have been important in identifying and predicting cardiovascular risk. These algorithms use a variety of data sources, including patient histories, clinical measures, and electronic health records, to discover people who could get cardiovascular problems. Methods of deep learning, a subset of machine learning hold the promise of enhancing the accuracy and effectiveness of cardiovascular risk prediction models. In this research, retinal images, clinical data, and various clinical features are employed to harness the capabilities of multimodal deep learning for predicting cardiovascular risk. The integration of these modalities enables a holistic assessment of an individual's cardiovascular health, contributing to the advancement of precision medicine in the realm of Cardiovascular Disease (CVD). The impact of this research extends beyond cardiovascular risk prediction, as it exemplifies the transformative potential of machine learning in healthcare. By empowering medical challenges with cutting-edge technology, our work addresses the urgent need for early risk assessment, patient stratification, and personalized interventions. This showcases how the synergy of different data types and deep learning can lead to improved clinical decision support, reduced healthcare costs, and, ultimately, enhanced patient outcomes. The potential to deploy such multimodal deep learning models in clinical practice has the potential to revolutionize the field of cardiovascular health and set a precedent for the broader role of machine learning in healthcare.
- Research Article
1063
- 10.1109/msp.2017.2738401
- Nov 1, 2017
- IEEE Signal Processing Magazine
The success of deep learning has been a catalyst to solving increasingly complex machine-learning problems, which often involve multiple data modalities. We review recent advances in deep multimodal learning and highlight the state-of the art, as well as gaps and challenges in this active research field. We first classify deep multimodal learning architectures and then discuss methods to fuse learned multimodal representations in deep-learning architectures. We highlight two areas of research–regularization strategies and methods that learn or optimize multimodal fusion structures–as exciting areas for future work.
- Research Article
7
- 10.1038/s41598-025-10512-1
- Jul 12, 2025
- Scientific Reports
This paper presents an approach for adaptive scheduling and robustness optimization in global logistics networks by integrating multimodal deep reinforcement learning with Internet of Things (IoT) technologies. We propose an integrated framework comprising a multimodal data fusion mechanism that synthesizes heterogeneous IoT sensor data, historical records, and contextual information; an adaptive deep reinforcement learning architecture that generates dynamic scheduling policies; and a multi-objective robust optimization method that balances operational efficiency with system resilience. The framework addresses key challenges in global logistics including demand volatility, transportation disruptions, and environmental uncertainties. Comprehensive experiments conducted on real-world logistics datasets demonstrate that our approach outperforms traditional methods with an 18.7% reduction in operational costs, 12.4% improvement in service levels, and significantly enhanced robustness under various disruption scenarios. The proposed method maintains 83% performance stability during complex disruptions compared to 51–72% for alternative approaches, while keeping computational requirements feasible for practical deployment. This research demonstrates potential contributions to AI-driven logistics operations management by showing improved supply chain performance through multimodal learning and robust optimization techniques.
- Research Article
311
- 10.1186/s12864-020-07319-x
- Jan 6, 2021
- BMC Genomics
BackgroundSeveral conventional genomic Bayesian (or no Bayesian) prediction methods have been proposed including the standard additive genetic effect model for which the variance components are estimated with mixed model equations. In recent years, deep learning (DL) methods have been considered in the context of genomic prediction. The DL methods are nonparametric models providing flexibility to adapt to complicated associations between data and output with the ability to adapt to very complex patterns.Main bodyWe review the applications of deep learning (DL) methods in genomic selection (GS) to obtain a meta-picture of GS performance and highlight how these tools can help solve challenging plant breeding problems. We also provide general guidance for the effective use of DL methods including the fundamentals of DL and the requirements for its appropriate use. We discuss the pros and cons of this technique compared to traditional genomic prediction approaches as well as the current trends in DL applications.ConclusionsThe main requirement for using DL is the quality and sufficiently large training data. Although, based on current literature GS in plant and animal breeding we did not find clear superiority of DL in terms of prediction power compared to conventional genome based prediction models. Nevertheless, there are clear evidences that DL algorithms capture nonlinear patterns more efficiently than conventional genome based. Deep learning algorithms are able to integrate data from different sources as is usually needed in GS assisted breeding and it shows the ability for improving prediction accuracy for large plant breeding data. It is important to apply DL to large training-testing data sets.
- Research Article
506
- 10.1109/access.2019.2916887
- Jan 1, 2019
- IEEE Access
Multimodal representation learning, which aims to narrow the heterogeneity gap among different modalities, plays an indispensable role in the utilization of ubiquitous multimodal data. Due to the powerful representation ability with multiple levels of abstraction, deep learning-based multimodal representation learning has attracted much attention in recent years. In this paper, we provided a comprehensive survey on deep multimodal representation learning which has never been concentrated entirely. To facilitate the discussion on how the heterogeneity gap is narrowed, according to the underlying structures in which different modalities are integrated, we category deep multimodal representation learning methods into three frameworks: joint representation, coordinated representation, and encoder-decoder. Additionally, we review some typical models in this area ranging from conventional models to newly developed technologies. This paper highlights on the key issues of newly developed technologies, such as encoder-decoder model, generative adversarial networks, and attention mechanism in a multimodal representation learning perspective, which, to the best of our knowledge, have never been reviewed previously, even though they have become the major focuses of much contemporary research. For each framework or model, we discuss its basic structure, learning objective, application scenes, key issues, advantages, and disadvantages, such that both novel and experienced researchers can benefit from this survey. Finally, we suggest some important directions for future work.