Computer-aided diagnosis of hepatocellular carcinoma fusing imaging and structured health data.
Hepatocellular carcinoma is the prevalent primary liver cancer, a silent disease that killed 782,000 worldwide in 2018. Multimodal deep learning is the application of deep learning techniques, fusing more than one data modality as the model's input. A computer-aided diagnosis system for hepatocellular carcinoma developed with multimodal deep learning approaches could use multiple data modalities as recommended by clinical guidelines, and enhance the robustness and the value of the second-opinion given to physicians. This article describes the process of creation and evaluation of an algorithm for computer-aided diagnosis of hepatocellular carcinoma developed with multimodal deep learning techniques fusing preprocessed computed-tomography images with structured data from patient Electronic Health Records. The classification performance achieved by the proposed algorithm in the test dataset was: accuracy = 86.9%, precision = 89.6%, recall = 86.9% and F-Score = 86.7%. These classification performance metrics are closer to the state-of-the-art in this area and were achieved with data modalities which are cheaper than traditional Magnetic Resonance Imaging approaches, enabling the use of the proposed algorithm by low and mid-sized healthcare institutions. The classification performance achieved with the multimodal deep learning algorithm is higher than human specialists diagnostic performance using only CT for diagnosis. Even though the results are promising, the multimodal deep learning architecture used for hepatocellular carcinoma prediction needs more training and test processes using different datasets before the use of the proposed algorithm by physicians in real healthcare routines. The additional training aims to confirm the classification performance achieved and enhance the model's robustness.
- Research Article
17
- 10.3389/frai.2023.1247195
- Oct 27, 2023
- Frontiers in artificial intelligence
Hepatocellular carcinoma is a malignant neoplasm of the liver and a leading cause of cancer-related deaths worldwide. The multimodal data combines several modalities, such as medical images, clinical parameters, and electronic health record (EHR) reports, from diverse sources to accomplish the diagnosis of liver cancer. The introduction of deep learning models with multimodal data can enhance the diagnosis and improve physicians' decision-making for cancer patients. This scoping review explores the use of multimodal deep learning techniques (i.e., combining medical images and EHR data) in diagnosing and prognosis of hepatocellular carcinoma (HCC) and cholangiocarcinoma (CCA). A comprehensive literature search was conducted in six databases along with forward and backward references list checking of the included studies. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) extension for scoping review guidelines were followed for the study selection process. The data was extracted and synthesized from the included studies through thematic analysis. Ten studies were included in this review. These studies utilized multimodal deep learning to predict and diagnose hepatocellular carcinoma (HCC), but no studies examined cholangiocarcinoma (CCA). Four imaging modalities (CT, MRI, WSI, and DSA) and 51 unique EHR records (clinical parameters and biomarkers) were used in these studies. The most frequently used medical imaging modalities were CT scans followed by MRI, whereas the most common EHR parameters used were age, gender, alpha-fetoprotein AFP, albumin, coagulation factors, and bilirubin. Ten unique deep-learning techniques were applied to both EHR modalities and imaging modalities for two main purposes, prediction and diagnosis. The use of multimodal data and deep learning techniques can help in the diagnosis and prediction of HCC. However, there is a limited number of works and available datasets for liver cancer, thus limiting the overall advancements of AI for liver cancer applications. Hence, more research should be undertaken to explore further the potential of multimodal deep learning in liver cancer applications.
- Supplementary Content
28
- 10.1093/genetics/iyae161
- Nov 5, 2024
- Genetics
Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.
- Research Article
4
- 10.1016/j.jdent.2023.104588
- Jun 21, 2023
- Journal of Dentistry
Multi-modal deep learning for automated assembly of periapical radiographs
- Research Article
160
- 10.1145/3545572
- Feb 17, 2023
- ACM Transactions on Multimedia Computing, Communications, and Applications
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.
- Dissertation
- 10.32657/10356/182346
- Jan 1, 2025
Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.
- Research Article
2
- 10.36001/phmap.2023.v4i1.3783
- Sep 4, 2023
- PHM Society Asia-Pacific Conference
Prognostics and Health Management (PHM) is identified as an important lever for enhancing the development of predictive maintenance to ensure the reliability, availability, and safety of industrial systems. However, the efficiency of data- driven PHM approaches is dependent on the quality and quantity of data. Therefore, exploiting multiple data sources can provide additional, useful information than single-modal data. For instance, by incorporating multiple data sources, including condition monitoring data, images from cameras, and texts from maintenance technicians’ reports, multi-modal learning can provide a more comprehensive and accurate understanding of the system’s health. However, multi-modal deep learning is complex to understand. To address this complexity, it is crucial to incorporate explainable artificial intelligent techniques to provide clear and interpretable insights into how the model makes decisions. In this light, this paper proposes the application of the model-agnostic-explanation approach, i.e., SHAP, to explain the working mechanism of multimodal learning for the prediction of industrial steam generator degradation. Particularly, we determine the important features of each data modality and investigate how multimodal learning can overcome the issues of low-quality data from a single modality due to the additional information from other data modalities.
- Research Article
39
- 10.1016/j.artmed.2023.102719
- Nov 15, 2023
- Artificial Intelligence in Medicine
Motivation:Acute ischemic stroke is one of the leading causes of morbidity and disability worldwide, often followed by a long rehabilitation period. To improve and personalize stroke rehabilitation, it is essential to provide a reliable prognosis to caregivers and patients. Deep learning techniques might improve the predictions by incorporating different data modalities. We present a multimodal approach to predict the functional status of acute ischemic stroke patients after their discharge based on tabular data and CT perfusion imaging. Methods:We conducted experiments on tabular, imaging, and multimodal deep learning architectures to predict dichotomized mRS scores 3 months after the event. The dataset was collected from a Dutch hospital and includes 98 CVA patients with a visible occlusion on their CT perfusion scan. Tabular data is based on the Dutch Acute Stroke Audit data, and imaging data consists of summed-up CT perfusion maps. Results:On the tabular data, TabNet outperformed our baselines with an AUC of 0.71, while ResNet-10 on the imaging data performed comparably with an AUC of 0.70. Our implementation of the multimodal DAFT architecture outperforms baselines as well as comparable studies by achieving an 0.75 AUC, and 0.80 F1 score. This was achieved with a final model of less than a hundred thousand optimizable parameters, and a dataset less than half the size of reference papers. Conclusion:Overall, we demonstrate the feasibility of predicting the functional outcome for ischemic stroke patients and the usability of multimodal deep learning architectures for this purpose.
- Research Article
3
- 10.2174/0115748936289033240424071522
- May 1, 2025
- Current Bioinformatics
Background: Cancer has emerged as the "leading killer" of human health. Survival prediction is a crucial branch of cancer prognosis. It aims to estimate patients' survival risk based on their disease conditions. Accurate and efficient survival prediction is vital in cancer patients' treatment and clinical management, preventing unnecessary suffering and conserving precious medical resources. Deep learning has been extensively applied in cancer diagnosis, prognosis, and treatment management. The decreasing cost of next-generation sequencing, continuous development of related databases, and in-depth research on multimodal deep learning have provided opportunities for establishing more functionally rich and accurate survival prediction models. Objective: The current area of cancer survival prediction still lacks a review of multimodal deep learning methods. Methods: We conducted a statistical analysis of the relevant research on multimodal deep learning for cancer survival prediction. We first filtered keywords from 6 known relevant papers. Then, we searched PubMed and Google Scholar for relevant publications from 2018 to 2022 using "Multimodal", "Deep Learning" and "Cancer Survival Prediction" as keywords. Then, we further searched the related publications through the backward and forward citation search. Subsequently, we conducted a detailed analysis and review of these studies based on their datasets and methods. Results: We present a comprehensive systematic review of the multimodal deep learning research on cancer survival prediction from 2018 to 2022. Conclusion: Multimodal deep learning has demonstrated powerful data aggregation capabilities and excellent performance in improving cancer survival prediction greatly. It has made a significant positive impact on facilitating the advancement of automated cancer diagnosis and precision oncology.
- Research Article
- 10.64235/j62xmk30
- Jan 10, 2025
- Journal of Science Technology and Social Transformation
Early detection of sepsis in intensive care units (ICUs) remains a critical challenge due to the rapid progression of the condition and the complexity of physiological signals associated with its onset. Advances in artificial intelligence, particularly deep learning, have enabled the development of predictive models capable of identifying early warning signs of sepsis from large-scale clinical datasets. However, many of these models operate as black-box systems, limiting their interpretability and reducing clinical trust. This study presents an explainable artificial intelligence (XAI)-driven multimodal deep learning framework designed to improve early sepsis prediction in ICU environments. The proposed approach integrates multiple healthcare data modalities, including vital signs, laboratory measurements, and electronic health records, to capture complex interactions among clinical variables. In addition to achieving high predictive performance, the framework incorporates explainability techniques that highlight the most influential clinical features contributing to the model’s predictions. The results demonstrate that the multimodal model improves prediction accuracy and enables earlier detection of sepsis compared to traditional machine learning approaches, while also providing transparent insights to support clinical decision-making. The findings highlight the potential of combining multimodal deep learning and explainable AI to enhance patient monitoring systems and assist healthcare professionals in making timely and informed interventions in critical care settings.
- Research Article
58
- 10.1016/j.imavis.2025.105509
- May 1, 2025
- Image and Vision Computing
Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.
- Research Article
139
- 10.2991/ijcis.d.200120.001
- Jan 1, 2020
- International Journal of Computational Intelligence Systems
Traffic flow forecasting has been regarded as a key problem of intelligent transport systems. In this work, we propose a hybrid multimodal deep learning method for short-term traffic flow forecasting, which can jointly and adaptively learn the spatial–temporal correlation features and long temporal interdependence of multi-modality traffic data by an attention auxiliary multimodal deep learning architecture. According to the highly nonlinear characteristics of multi-modality traffic data, the base module of our method consists of one-dimensional convolutional neural networks (1D CNN) and gated recurrent units (GRU) with the attention mechanism. The former is to capture the local trend features and the latter is to capture the long temporal dependencies. Then, we design a hybrid multimodal deep learning framework for fusing share representation features of different modality traffic data by multiple CNN-GRU-Attention modules. The experimental results indicate that the proposed multimodal deep learning model is capable of dealing with complex nonlinear urban traffic flow forecasting with satisfying accuracy and effectiveness.
- Research Article
1
- 10.1158/1538-7445.am2024-2313
- Mar 22, 2024
- Cancer Research
Purpose: Cancer patients routinely undergo radiologic and pathologic evaluation for their diagnostic workup. These data modalities represent a valuable and readily available resource for developing new prognostic tools. Given their vast difference in spatial scales, effective methods to integrate the two modalities are currently lacking. Here, we aim to develop a multi-modal approach to integrate radiology and pathology images for predicting outcomes in cancer patients. Methods: We propose a multi-modal weakly-supervised deep learning framework to integrate radiology and pathology images for survival prediction. We first extract multi-scale features from whole-slide H&E-stained pathology images to characterize cellular and tissue phenotypes as well as spatial cellular organization. We then build a hierarchical co-attention transformer to effectively learn the multi-modal interactions between radiology and pathology image features. Finally, a multimodal risk score is derived by combining complementary information from two images modalities and clinical data for predicting outcome. We evaluate our approach in lung, gastric, and brain cancers with matched radiology and pathology images and clinical data available, each with separate training and external validation cohorts. Results: The multi-modal deep learning models achieved a reasonably high accuracy for predicting survival outcomes in the external validation cohorts (C-index range: 0.72-0.75 across three cancer types). The multi-modal prognostic models significantly improved upon single-modal approach based on radiology or pathology images or clinical data alone (C-index range: 0.53-0.71, P<0.01). The multi-modal deep learning models were significantly associated with disease-free survival and overall survival (hazard ratio range: 3.23-4.46, P<0.0001). In multivariable analyses, the models remained an independent prognostic factor (P<0.01) after adjusting for clinicopathological variables including cancer stage and tumor differentiation. Conclusions: The proposed multi-modal deep learning approach outperforms traditional methods for predicting survival outcomes by leveraging routinely available radiology and pathology images. With further independent validation, this may afford a promising approach to improve risk stratification and better inform treatment strategies for cancer patients. Citation Format: Zhe Li, Yuming Jiang, Ruijiang Li. Multi-modal deep learning to predict cancer outcomes by integrating radiology and pathology images [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2313.
- Research Article
- 10.1177/11769351261420789
- Feb 1, 2026
- Cancer informatics
This research aims to develop and evaluate a clinically deployable multimodal deep learning framework for breast cancer diagnosis that maintains robustness, even when clinical data are asynchronous, unpaired, or incomplete, effectively addressing real-world challenges related to data heterogeneity and fragmented clinical workflows. In this retrospective study, a multimodal deep learning architecture was developed that integrates histopathological images with structured clinical risk factors. Custom models were developed and independently trained for each modality, and late fusion was achieved via a dynamically reweighted Sinkhorn-based fusion layer. Model performance was evaluated using precision-recall Area Under Curve (PR-AUC), recall, F1 score, and Brier score under complete and partial modality availability scenarios. Robustness and clinical utility were further assessed through statistical significance testing and decision curve analysis (DCA). Additionally, we employed a Sinkhorn cost matrix to enhance interpretability. The proposed Sinkhorn fusion model outperformed all baseline methods, achieving the highest recall (0.96), PR-AUC (0.775), F1 score (0.828), and the best calibration (Brier score ≈ 0.19). Notably, it maintained perfect recall (1.00) under a 50% simulated modality dropout, despite a significant drop in PR-AUC (20% vs 0%: t = -20.35, P < .0001; 50% vs 0%: t = 88.60, P < .0001), portraying a strong overall robustness to information missingness. Under internally controlled conditions, DCA demonstrated superior clinical utility across thresholds of 0.2 to 0.7. The model's ability to accommodate unpaired and incomplete clinical inputs while maintaining both calibration and sensitivity makes it particularly well-suited for deployment in asynchronous and resource-constrained settings. Its consistent performance under clinical uncertainty and minimal preprocessing requirements represents a significant advancement toward equitable, reliable, and scalable AI-assisted breast cancer screening. To our knowledge, this is the first paper to model breast cancer late fusion as an optimal transport problem.
- Book Chapter
- 10.1201/9781032625829-5
- Nov 7, 2024
An important predictor of human cognitive and physical performance, it is necessary to precisely and efficiently measure mental workload for applications ranging from individualized health care to productivity enhancement. Present methods rely mainly, and separately, on physiological measurements or eye-tracking data, critically limiting the precision with which mental workload can be accurately assessed across levels of memory, response time, and precision. The present findings point to the need for an integrated, multimodal strategy to surmount these limitations, and significantly improve mental workload estimation accuracy. As a result, here we propose a novel multimodal deep learning architecture that effectively incorporates eye-tracking and physiological data. Apart from complex information related to fixation time, saccade velocity, and averaged pupil diameter from eye movement data, our methodology captures a range of physiological signals, such as ECG readings, glucose fluctuations, and blood pressure changes. An accurate assessment of mental demands is then made by fusing data from multiple sources using ensemble learning and an efficient 1D Convolutional Neural Network (1D CNN) classifier. The proposed model outperformed previous techniques with 2.9%, 3.5%, and 3.4% increases in precision, accuracy, and recall, respectively. The fact that the methodology also demonstrated a 2.5% drop in latency levels further reinforced the promise of a faster implementation of the approach for a more responsive, real-time mental workload estimation technique. The current study establishes the groundbreaking potential of our multimodal approach in providing a thorough and accurate assessment of mental burdens, thereby opening up important applications in a wide variety of domains.
- Research Article
176
- 10.1038/s41746-022-00613-w
- Jun 8, 2022
- NPJ Digital Medicine
Prostate cancer is the most frequent cancer in men and a leading cause of cancer death. Determining a patient’s optimal therapy is a challenge, where oncologists must select a therapy with the highest likelihood of success and the lowest likelihood of toxicity. International standards for prognostication rely on non-specific and semi-quantitative tools, commonly leading to over- and under-treatment. Tissue-based molecular biomarkers have attempted to address this, but most have limited validation in prospective randomized trials and expensive processing costs, posing substantial barriers to widespread adoption. There remains a significant need for accurate and scalable tools to support therapy personalization. Here we demonstrate prostate cancer therapy personalization by predicting long-term, clinically relevant outcomes using a multimodal deep learning architecture and train models using clinical data and digital histopathology from prostate biopsies. We train and validate models using five phase III randomized trials conducted across hundreds of clinical centers. Histopathological data was available for 5654 of 7764 randomized patients (71%) with a median follow-up of 11.4 years. Compared to the most common risk-stratification tool—risk groups developed by the National Cancer Center Network (NCCN)—our models have superior discriminatory performance across all endpoints, ranging from 9.2% to 14.6% relative improvement in a held-out validation set. This artificial intelligence-based tool improves prognostication over standard tools and allows oncologists to computationally predict the likeliest outcomes of specific patients to determine optimal treatment. Outfitted with digital scanners and internet access, any clinic could offer such capabilities, enabling global access to therapy personalization.