Multimodal Deep Learning for Cancer Survival Prediction: A Review

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Background: Cancer has emerged as the "leading killer" of human health. Survival prediction is a crucial branch of cancer prognosis. It aims to estimate patients' survival risk based on their disease conditions. Accurate and efficient survival prediction is vital in cancer patients' treatment and clinical management, preventing unnecessary suffering and conserving precious medical resources. Deep learning has been extensively applied in cancer diagnosis, prognosis, and treatment management. The decreasing cost of next-generation sequencing, continuous development of related databases, and in-depth research on multimodal deep learning have provided opportunities for establishing more functionally rich and accurate survival prediction models. Objective: The current area of cancer survival prediction still lacks a review of multimodal deep learning methods. Methods: We conducted a statistical analysis of the relevant research on multimodal deep learning for cancer survival prediction. We first filtered keywords from 6 known relevant papers. Then, we searched PubMed and Google Scholar for relevant publications from 2018 to 2022 using "Multimodal", "Deep Learning" and "Cancer Survival Prediction" as keywords. Then, we further searched the related publications through the backward and forward citation search. Subsequently, we conducted a detailed analysis and review of these studies based on their datasets and methods. Results: We present a comprehensive systematic review of the multimodal deep learning research on cancer survival prediction from 2018 to 2022. Conclusion: Multimodal deep learning has demonstrated powerful data aggregation capabilities and excellent performance in improving cancer survival prediction greatly. It has made a significant positive impact on facilitating the advancement of automated cancer diagnosis and precision oncology.

Similar Papers
  • Book Chapter
  • Cite Count Icon 33
  • 10.1007/978-3-031-16443-9_60
Survival Prediction of Brain Cancer with Incomplete Radiology, Pathology, Genomic, and Demographic Data
  • Jan 1, 2022
  • Can Cui + 9 more

Integrating cross-department multi-modal data (e.g., radiology, pathology, genomic, and demographic data) is ubiquitous in brain cancer diagnosis and survival prediction. To date, such an integration is typically conducted by human physicians (and panels of experts), which can be subjective and semi-quantitative. Recent advances in multi-modal deep learning, however, have opened a door to leverage such a process in a more objective and quantitative manner. Unfortunately, the prior arts of using four modalities on brain cancer survival prediction are limited by a “complete modalities” setting (i.e., with all modalities available). Thus, there are still open questions on how to effectively predict brain cancer survival from incomplete radiology, pathology, genomic, and demographic data (e.g., one or more modalities might not be collected for a patient). For instance, should we use both complete and incomplete data, and more importantly, how do we use such data? To answer the preceding questions, we generalize the multi-modal learning on cross-department multi-modal data to a missing data setting. Our contribution is three-fold: 1) We introduce a multi-modal learning with missing data (MMD) pipeline with competitive performance and less hardware consumption; 2) We extend multi-modal learning on radiology, pathology, genomic, and demographic data into missing data scenarios; 3) A large-scale public dataset (with 962 patients) is collected to systematically evaluate glioma tumor survival prediction using four modalities. The proposed method improved the C-index of survival prediction from 0.7624 to 0.8053.KeywordsMulti-modal learningSurvival predictionMissing modalities

  • Research Article
  • 10.1109/jbhi.2025.3578859
DRLSurv: Disentangled Representation Learning for Cancer Survival Prediction by Mining Multimodal Consistency and Complementarity.
  • Jan 1, 2025
  • IEEE journal of biomedical and health informatics
  • Ying Xu + 5 more

Accurate cancer survival prediction is crucial in devising optimal treatment plans and offering individualized care to improve clinical outcomes. Recent researches confirm that integrating heterogenous cancer data such as histopathological images and genomic data, can enhance our understanding of cancer progression and provides a multimodal perspective on patient survival chances. However, existing methods often over-look the fundamental aspects of multimodal data, i.e., consistency and complementarity, which in consequence significantly hinder advancements in cancer survival prediction. To address this issue, we represent DRLSurv, a novel multimodal deep learning method that leverages disentangled representation learning for precise cancer survival prediction. Through dedicated deep encoding networks, DRLSurv decomposes each modality into modality-invariant and modality-specific representations, which are mapped to common and unique feature subspaces for simultaneously mining the distinct aspects of cancer multimodal data. Moreover, our method innovatively introduces a subspace-based proximity contrastive loss and re-disentanglement loss, thus ensuring the successful decomposition of consistent and complementary information while maintaining the multimodal fidelity during the learning of disentangled representations. Both quantitative analyses and visual assessments on different datasets validate the superiority of DRLSurv over existing survival prediction approaches, demonstrating its powerful capability to exploit enriched survival-related information from cancer multimodal data. Therefore, DRLSurv not only offers a unified and comprehensive deep learning framework for advancing multimodal survival predictions, but also provides valuable insights for cancer prognosis and survival analysis.

  • Supplementary Content
  • Cite Count Icon 28
  • 10.1093/genetics/iyae161
A review of multimodal deep learning methods for genomic-enabled predictionin plant breeding
  • Nov 5, 2024
  • Genetics
  • Osval A Montesinos-López + 9 more

Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.csbj.2025.10.038
Multimodal fusion strategies for survival prediction in breast cancer: A comparative deep learning study
  • Jan 1, 2025
  • Computational and Structural Biotechnology Journal
  • Aurora Sucre + 7 more

Accurate survival prediction in breast cancer remains a key challenge in oncology, requiring models that can integrate diverse clinical, molecular, and imaging data sources to guide breast cancer management. While recent deep learning models have explored multimodal integration for cancer survival prediction, their generalizability to unseen data remains limited. In this study, we developed and optimized unimodal and multimodal models for breast cancer survival prediction, systematically assessing our optimized early and late integration strategies and their impact on out-of-sample generalization performance. We integrated clinical variables, somatic mutations, RNA expression, copy number variation, miRNA expression, and histopathology images from The Cancer Genome Atlas breast cancer dataset. Across all modality combinations, late fusion models consistently outperformed early fusion approaches and late and intermediate benchmark methods, with the combination of omics and clinical data yielding the highest test-set concordance indices. Explainability analyses showed that our models captured biologically relevant features associated with patient survival. These findings highlight the value of late-fusion multimodal deep learning frameworks for robust and explainable survival prediction in breast cancer.

  • Research Article
  • Cite Count Icon 400
  • 10.1007/s00371-021-02166-7
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
  • Jun 10, 2021
  • The Visual Computer
  • Khaled Bayoudh + 3 more

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.jdent.2023.104588
Multi-modal deep learning for automated assembly of periapical radiographs
  • Jun 21, 2023
  • Journal of Dentistry
  • L Pfänder + 5 more

Multi-modal deep learning for automated assembly of periapical radiographs

  • Dissertation
  • 10.32657/10356/182346
Data efficient deep multimodal learning
  • Jan 1, 2025
  • Meng Shen

Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.

  • Research Article
  • Cite Count Icon 160
  • 10.1145/3545572
A Review on Methods and Applications in Multimodal Deep Learning
  • Feb 17, 2023
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Summaira Jabeen + 5 more

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.

  • Research Article
  • Cite Count Icon 1
  • 10.1158/1538-7445.am2024-2313
Abstract 2313: Multi-modal deep learning to predict cancer outcomes by integrating radiology and pathology images
  • Mar 22, 2024
  • Cancer Research
  • Zhe Li + 2 more

Purpose: Cancer patients routinely undergo radiologic and pathologic evaluation for their diagnostic workup. These data modalities represent a valuable and readily available resource for developing new prognostic tools. Given their vast difference in spatial scales, effective methods to integrate the two modalities are currently lacking. Here, we aim to develop a multi-modal approach to integrate radiology and pathology images for predicting outcomes in cancer patients. Methods: We propose a multi-modal weakly-supervised deep learning framework to integrate radiology and pathology images for survival prediction. We first extract multi-scale features from whole-slide H&E-stained pathology images to characterize cellular and tissue phenotypes as well as spatial cellular organization. We then build a hierarchical co-attention transformer to effectively learn the multi-modal interactions between radiology and pathology image features. Finally, a multimodal risk score is derived by combining complementary information from two images modalities and clinical data for predicting outcome. We evaluate our approach in lung, gastric, and brain cancers with matched radiology and pathology images and clinical data available, each with separate training and external validation cohorts. Results: The multi-modal deep learning models achieved a reasonably high accuracy for predicting survival outcomes in the external validation cohorts (C-index range: 0.72-0.75 across three cancer types). The multi-modal prognostic models significantly improved upon single-modal approach based on radiology or pathology images or clinical data alone (C-index range: 0.53-0.71, P<0.01). The multi-modal deep learning models were significantly associated with disease-free survival and overall survival (hazard ratio range: 3.23-4.46, P<0.0001). In multivariable analyses, the models remained an independent prognostic factor (P<0.01) after adjusting for clinicopathological variables including cancer stage and tumor differentiation. Conclusions: The proposed multi-modal deep learning approach outperforms traditional methods for predicting survival outcomes by leveraging routinely available radiology and pathology images. With further independent validation, this may afford a promising approach to improve risk stratification and better inform treatment strategies for cancer patients. Citation Format: Zhe Li, Yuming Jiang, Ruijiang Li. Multi-modal deep learning to predict cancer outcomes by integrating radiology and pathology images [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2313.

  • Research Article
  • Cite Count Icon 58
  • 10.1016/j.imavis.2025.105509
A systematic review of intermediate fusion in multimodal deep learning for biomedical applications
  • May 1, 2025
  • Image and Vision Computing
  • Valerio Guarrasi + 6 more

Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.

  • Research Article
  • 10.55041/ijsrem52491
Multi-Modal Learning Approaches Combining EHR, Imaging, and Genomic Data
  • Sep 9, 2025
  • INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Veerendra Nath Jasthi

Abstract— Processing data-driven healthcare allowed us unprecedented chances to enhance diagnoses, foreseen, and customized treatment by means of multi-modal learning. The present paper discusses the development of electronic health records (EHR), medical images, and genomic data through multi-modal deep learning. Multi-modal models are able to capture richer feature representations and more complex patterns not visible with unimodal processing through the use of heterogeneous data sources, and thus by combining their complementary strengths. We propose an end-to-end protocol to align, preprocess, and fuse modalities and demonstrate an application of deep neural networks learning in tandem about these structured pieces of EHR and high dimensional imaging attributes alongside gene expression data. Through experiments, it is revealed that the proposed model has better performance on the task of disease classification and patient stratification compared to single-modality counterparts. The paper highlights the need to not only ensure data alignment, imputation of missing modalities and learning representations specifically in the domain of modalities to fully utilize multi-modal in the clinical context. Keywords— Multi-modal Learning, Electronic Health Records (EHR), Medical Imaging, Genomic Data, Deep Learning, Data Fusion, Healthcare AI, Precision Medicine, Patient Stratification, Biomedical Informatics.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 13
  • 10.3389/fpls.2023.1094142
Study on the detection of water status of tomato (Solanum lycopersicum L.) by multimodal deep learning
  • May 31, 2023
  • Frontiers in Plant Science
  • Zhiyu Zuo + 7 more

Water plays a very important role in the growth of tomato (Solanum lycopersicum L.), and how to detect the water status of tomato is the key to precise irrigation. The objective of this study is to detect the water status of tomato by fusing RGB, NIR and depth image information through deep learning. Five irrigation levels were set to cultivate tomatoes in different water states, with irrigation amounts of 150%, 125%, 100%, 75%, and 50% of reference evapotranspiration calculated by a modified Penman-Monteith equation, respectively. The water status of tomatoes was divided into five categories: severely irrigated deficit, slightly irrigated deficit, moderately irrigated, slightly over-irrigated, and severely over-irrigated. RGB images, depth images and NIR images of the upper part of the tomato plant were taken as data sets. The data sets were used to train and test the tomato water status detection models built with single-mode and multimodal deep learning networks, respectively. In the single-mode deep learning network, two CNNs, VGG-16 and Resnet-50, were trained on a single RGB image, a depth image, or a NIR image for a total of six cases. In the multimodal deep learning network, two or more of the RGB images, depth images and NIR images were trained with VGG-16 or Resnet-50, respectively, for a total of 20 combinations. Results showed that the accuracy of tomato water status detection based on single-mode deep learning ranged from 88.97% to 93.09%, while the accuracy of tomato water status detection based on multimodal deep learning ranged from 93.09% to 99.18%. The multimodal deep learning significantly outperformed the single-modal deep learning. The tomato water status detection model built using a multimodal deep learning network with ResNet-50 for RGB images and VGG-16 for depth and NIR images was optimal. This study provides a novel method for non-destructive detection of water status of tomato and gives a reference for precise irrigation management.

  • Research Article
  • Cite Count Icon 37
  • 10.1093/bioinformatics/btad025
CAMR: cross-aligned multimodal representation learning for cancer survival prediction.
  • Jan 1, 2023
  • Bioinformatics
  • Xingqi Wu + 3 more

Accurately predicting cancer survival is crucial for helping clinicians to plan appropriate treatments, which largely improves the life quality of cancer patients and spares the related medical costs. Recent advances in survival prediction methods suggest that integrating complementary information from different modalities, e.g. histopathological images and genomic data, plays a key role in enhancing predictive performance. Despite promising results obtained by existing multimodal methods, the disparate and heterogeneous characteristics of multimodal data cause the so-called modality gap problem, which brings in dramatically diverse modality representations in feature space. Consequently, detrimental modality gaps make it difficult for comprehensive integration of multimodal information via representation learning and therefore pose a great challenge to further improvements of cancer survival prediction. To solve the above problems, we propose a novel method called cross-aligned multimodal representation learning (CAMR), which generates both modality-invariant and -specific representations for more accurate cancer survival prediction. Specifically, a cross-modality representation alignment learning network is introduced to reduce modality gaps by effectively learning modality-invariant representations in a common subspace, which is achieved by aligning the distributions of different modality representations through adversarial training. Besides, we adopt a cross-modality fusion module to fuse modality-invariant representations into a unified cross-modality representation for each patient. Meanwhile, CAMR learns modality-specific representations which complement modality-invariant representations and therefore provides a holistic view of the multimodal data for cancer survival prediction. Comprehensive experiment results demonstrate that CAMR can successfully narrow modality gaps and consistently yields better performance than other survival prediction methods using multimodal data. CAMR is freely available at https://github.com/wxq-ustc/CAMR. Supplementary data are available at Bioinformatics online.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.acra.2024.12.018
Multimodal Deep Learning Fusing Clinical and Radiomics Scores for Prediction of Early-Stage Lung Adenocarcinoma Lymph Node Metastasis.
  • May 1, 2025
  • Academic radiology
  • Chengcheng Xia + 8 more

Multimodal Deep Learning Fusing Clinical and Radiomics Scores for Prediction of Early-Stage Lung Adenocarcinoma Lymph Node Metastasis.

  • Research Article
  • Cite Count Icon 37
  • 10.1007/s13755-021-00151-x
Computer-aided diagnosis of hepatocellular carcinoma fusing imaging and structured health data.
  • May 4, 2021
  • Health Information Science and Systems
  • Alan Baronio Menegotto + 2 more

Hepatocellular carcinoma is the prevalent primary liver cancer, a silent disease that killed 782,000 worldwide in 2018. Multimodal deep learning is the application of deep learning techniques, fusing more than one data modality as the model's input. A computer-aided diagnosis system for hepatocellular carcinoma developed with multimodal deep learning approaches could use multiple data modalities as recommended by clinical guidelines, and enhance the robustness and the value of the second-opinion given to physicians. This article describes the process of creation and evaluation of an algorithm for computer-aided diagnosis of hepatocellular carcinoma developed with multimodal deep learning techniques fusing preprocessed computed-tomography images with structured data from patient Electronic Health Records. The classification performance achieved by the proposed algorithm in the test dataset was: accuracy = 86.9%, precision = 89.6%, recall = 86.9% and F-Score = 86.7%. These classification performance metrics are closer to the state-of-the-art in this area and were achieved with data modalities which are cheaper than traditional Magnetic Resonance Imaging approaches, enabling the use of the proposed algorithm by low and mid-sized healthcare institutions. The classification performance achieved with the multimodal deep learning algorithm is higher than human specialists diagnostic performance using only CT for diagnosis. Even though the results are promising, the multimodal deep learning architecture used for hepatocellular carcinoma prediction needs more training and test processes using different datasets before the use of the proposed algorithm by physicians in real healthcare routines. The additional training aims to confirm the classification performance achieved and enhance the model's robustness.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant