Articles published on Multimodal Deep Learning
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1418 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.autcon.2026.106851
- May 1, 2026
- Automation in Construction
- Jaehui Son + 3 more
Scenario-based multimodal deep learning framework for simultaneous detection of construction accident causal factors and risk evaluation
- New
- Research Article
- 10.1016/j.ejca.2026.116679
- May 1, 2026
- European journal of cancer (Oxford, England : 1990)
- Xiangxue Wang + 14 more
MuTriM: A multiscale deep learning model integrating longitudinal radiomics and pathomic features for predicting recurrence and adjuvant radiation benefit in breast cancer.
- New
- Research Article
- 10.1016/j.cmpb.2026.109306
- May 1, 2026
- Computer methods and programs in biomedicine
- Xuemin Liu + 4 more
Integrating multimodal data and deep learning for functional assessment and rehabilitation prediction after cerebral hemorrhage.
- New
- Research Article
- 10.1016/j.ejogrb.2026.115018
- May 1, 2026
- European journal of obstetrics, gynecology, and reproductive biology
- Peijun Li + 16 more
Preoperative identification of deep myometrial invasion in endometrial cancer: a multicenter MRI study with a vision foundation model-enhanced multimodal deep learning framework.
- New
- Research Article
- 10.1016/j.jbi.2026.105001
- May 1, 2026
- Journal of biomedical informatics
- Jennifer Martin + 9 more
Explainable multimodal deep learning models for variable-length sequences in critically ill patients.
- New
- Research Article
- 10.1016/j.compstruc.2026.108216
- May 1, 2026
- Computers & Structures
- Feiyu Zhou + 1 more
Transformer self-attention encoder–decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
- New
- Research Article
- 10.22214/ijraset.2026.80540
- Apr 30, 2026
- International Journal for Research in Applied Science and Engineering Technology
- Muppidi Siva Narayana
Recent advancements in Artificial Intelligence have enabled the development of multimodal systems capable of reasoning over both visual and textual data. Visual Question Answering (VQA) is a key application in this domain; however, most existing models operate as black-box systems, lacking transparency and interpretability. Additionally, these systems often suffer from language bias, leading to unreliable and non-generalizable predictions. To address these limitations, this paper proposes ECM²RS (Explainable Causal Multi-Modal Reasoning System), a novel framework that integrates multimodal deep learning with neuro-symbolic reasoning and explainability techniques. The system leverages LLaVA as the core reasoning engine and incorporates multi-level explanation modules, including visual explanations using gradient-based methods, textual explanations via attention mechanisms, and knowledge-based reasoning from external datasets. The proposed approach is evaluated using VQA, CLEVR, and ScienceQA datasets to ensure both real-world applicability and logical reasoning capability. Experimental results demonstrate that ECM²RS enhances interpretability while reducing black-box behaviour, producing coherent and explainable reasoning outputs. This work contributes toward building trustworthy and interpretable multimodal AI systems
- New
- Research Article
- 10.47760/ijcsmc.2026.v15i04.012
- Apr 30, 2026
- International Journal of Computer Science and Mobile Computing
- S Nandhinidevi + 1 more
Lung diseases like COVID-19 and Pneumonia represent a significant global health challenge which need the accurate and timely diagnostic system. Even traditional machine learning and deep learning methods provides a solution using medical images, it depends on consolidating large amounts of data into a centralized location. The centralized data collection process deals different problems such as privacy concern, data cracks and unauthorized access of data due to the medical data are more sensitive. In this proposed work, these problems are addressed by integrating federated learning framework to provide a privacy-preserving distributed learning environment that allows the model to train without sharing medical data. This proposed work introduces a federated learning framework utilizing existing deep learning architectures such as InceptionV3, ResNet50, and DenseNet121. To simulate the distributed environment, the dataset is distributed across three clients and every model is trained in each client. After local models are trained independently, the weights of global model are updated using FedAvg algorithm. Finally, the performance of the three proposed models is evaluated with various metrics such as accuracy, precision, recall, and F1-score. Experimental results shows that DenseNet121 achieves highest performance with the highest classification accuracy as 90%, owing to its dense connectivity and efficient feature reuse capability.
- New
- Research Article
- 10.1088/1361-6579/ae6415
- Apr 23, 2026
- Physiological measurement
- Hong Duc Nguyen + 1 more
Accurate physiological assessment of cardiac function from heart sounds
remains challenging due to background noise, variable heart rates, and the need for
reliable cardiac-cycle segmentation. This study aimed to develop a fully E2E deep learning
framework that extracts diagnostic information directly from raw heart sound recordings
for cardiac abnormality detection and classification. We propose HS-MMNet,
an E2E multi-modal deep learning framework designed for physiological heart sound
analysis. Recordings are preprocessed (normalization and 25-400 Hz bandpass filtering)
and divided into fixed-length 2.5-s segments. A Convolution Head with multi-atrous
spatial pyramid and channel-spatial attention extracts fine-grained local temporal patterns
from the filtered 1-D waveform. A Transformer Head captures long-range spectro-temporal
dependencies from Log-Mel spectrograms. These hypotheses are iteratively fused by a
novel Multi-Hypothesis Cross-Attention (MH-CA) module with cyclic query-key-value
assignment and a hypothesis-mixing MLP, enabling rich cross-site interaction and effective
suppression of noise and non-informative regions. Recording-level classification is obtained
via a fully connected layer. On the PhysioNet/CinC Challenge 2016 dataset,
HS-MMNet achieved 94.80% accuracy, 92.10% sensitivity, 96.85% specificity, 87.50%
precision, and 89.74% F1-score, outperforming all previously reported methods. On the
balanced five-class Yaseen dataset (normal, aortic stenosis, mitral regurgitation, mitral
stenosis, mitral valve prolapse), it attained 99.60% macro-averaged precision, recall, and
F1-score with only four misclassifications in 1000 recordings, establishing new
state-of-the-art (SOTA) benchmarks. HS-MMNet represents an advance in
automated physiological measurement from heart sounds. By eliminating cardiac cycle
detection and multi-channel requirements while achieving SOTA diagnostic performance,
it provides a practical, scalable solution for accurate cardiovascular screening with
primary-care and low-resource settings.
- New
- Research Article
- 10.1007/s10278-026-01964-6
- Apr 22, 2026
- Journal of imaging informatics in medicine
- Tianhao Xiang + 1 more
Accurate differentiation between benign and malignant thyroid nodules remains challenging in clinical practice. Current deep learning approaches predominantly rely on single-modality analysis, failing to leverage complementary information from multiple clinical data sources. This study aims to develop and validate ThyroFusion, a multi-modal deep learning framework integrating ultrasound images, segmentation masks, and clinical text reports for improved thyroid nodule malignancy risk assessment. In this retrospective multi-center study, we developed ThyroFusion, a multi-modal fusion framework comprising: (1) a dual-stream ResNet-50 encoder with partially shared parameters for extracting features from ultrasound images and segmentation masks; (2) a Set Transformer module for aggregating variable numbers of image features; and (3) a bidirectional cross-modal attention mechanism for fusing visual and textual features extracted by frozen BioBERT. The framework was trained on 1472 cases from Xi'an International Medical Center Hospital and validated on four independent external test sets totaling 4530 cases from two clinical centers and two public datasets (DDTI and TN3K). Performance was compared against state-of-the-art deep learning models and radiologists with varying experience levels. ThyroFusion achieved an AUC of 0.937 (95% CI 0.914-0.960) on internal validation and 0.896 (95% CI 0.887-0.905) on combined external validation. Compared to single-modal approaches, ThyroFusion significantly outperformed ResNet-50 (AUC: 0.841), DenseNet-121 (AUC 0.848), EfficientNet-B4 (AUC 0.859), and Vision Transformer (AUC 0.835) on external validation (all p < 0.001). The model also outperformed senior radiologists (AUC 0.809) and demonstrated substantial improvement in junior radiologists' performance when used as an assistive tool (ΔAUC = 0.126). On public datasets, ThyroFusion achieved AUCs of 0.893 on DDTI and 0.881 on TN3K, demonstrating robust cross-domain generalization. ThyroFusion demonstrates robust performance in thyroid nodule malignancy risk assessment across multiple centers and public benchmarks, significantly outperforming state-of-the-art single-modal methods and experienced radiologists. The integration of visual and textual information through bidirectional cross-modal attention offers a promising tool for clinical decision support.
- New
- Research Article
- 10.1038/s41598-026-50189-8
- Apr 22, 2026
- Scientific reports
- Lin Wang + 3 more
Multimodal deep learning for anomaly detection in urban infrastructure networks: improving the resilience of public management systems.
- New
- Research Article
- 10.1080/17538947.2026.2662059
- Apr 21, 2026
- International Journal of Digital Earth
- Sining Duan + 5 more
With increasing demand for natural gas, the construction of natural gas extraction-related facilities has increased significantly. Accurate identification of these facilities is crucial for guiding spatial planning and evaluating environmental impacts. Existing research has primarily concentrated on offshore facilities, with limited attention to onshore facilities. This scarcity stems from identification challenges due to their dispersed distribution and complex environments. To address this gap, this study proposes a method combining a multimodal convolutional neural network (CNN) with object-based segmentation for onshore facility extraction. Experiments were conducted in northern Sichuan, China, with high-resolution Chinese satellite images, GF-2. Performance was compared between machine learning and CNN using sequentially cropped imageries. The proposed method achieved a precision of 59.97%, a recall of 94.87%, and an F1-score of 73.49%. The high recall indicates that most facilities were successfully detected, and the F1-score reflects the overall performance. These results suggest that the proposed method can effectively extract onshore facilities. Compared with machine learning and CNN using sequentially cropped imageries, the F1-score of the proposed method increased by 20.16% and 51.49%, respectively. The experimental results reveal that the proposed method can accurately identify onshore facilities, offering a scientific basis for assessing the environmental impact of greenhouse gases.
- New
- Research Article
- 10.3389/fneur.2026.1791696
- Apr 21, 2026
- Frontiers in Neurology
- Chaojun Chen + 4 more
Background and objectives Identifying multiple sclerosis (MS) in children early is critical, as early therapeutic intervention can improve outcomes. The anterior visual pathway has been demonstrated to be of central importance in diagnostic considerations for MS and has recently been identified as a fifth topography in the McDonald Diagnostic Criteria for MS. Optical coherence tomography (OCT) provides high-resolution retinal imaging and reflects the structural integrity of the retinal nerve fiber and ganglion cell inner plexiform layers. Whether multimodal deep learning models can use OCT alone to diagnose pediatric onset MS (POMS) is unknown. Methods We analyzed 3D OCT scans collected prospectively through the Neuroinflammatory Registry of the Hospital for Sick Children (REB#1000005356). Raw macular and optic nerve head images, and 52 automatically segmented features were included. We evaluated three classification approaches: (1) deep learning models (e.g., ResNet, DenseNet) for representation learning followed by classical ML classifiers, (2) ML models trained on OCT-derived features, and (3) multimodal models combining both via early and late fusion. Results Scans from individuals with POMS (onset 16.0 ± 3.1 years, 51.0% female; 211 scans) and 29 children with non-inflammatory neurological conditions (13.1 ± 4.0 years, 69.0% female, 52 scans) were included. The early fusion model achieved the highest performance (AUC: 0.90, weighted F 1 : 0.87, macro F 1 : 0.77, accuracy: 87%), outperforming both unimodal and late fusion models. The best unimodal feature-based model (SVC) yielded an AUC of 0.84, weighted F 1 of 0.85, macro F 1 of 0.73, and accuracy of 85%, while the best image-based model (ResNet101 with SVC) achieved an AUC of 0.79, weighted F 1 of 0.84, macro F 1 of 0.70, and accuracy of 87%. Late fusion underperformed, reaching 82% accuracy but failing in the minority class. Discussion Multimodal learning with early fusion significantly enhances diagnostic performance by combining spatial retinal information with clinically relevant structural features. This approach captures complementary patterns associated with MS pathology and shows promise as an AI-driven tool to support pediatric neuroinflammatory diagnosis.
- New
- Research Article
- 10.1007/s42452-026-08671-5
- Apr 20, 2026
- Discover Applied Sciences
- Xia Hou
Early warning and intervention for college students’ mental health status based on multimodal deep learning
- New
- Research Article
- 10.17148/ijarcce.2026.154124
- Apr 19, 2026
- IJARCCE
- Indu P K + 2 more
Availability-Aware Multimodal Deep Learning for Breast Cancer Diagnosis with Missing Modalities
- New
- Research Article
- 10.1080/10589759.2026.2651926
- Apr 16, 2026
- Nondestructive Testing and Evaluation
- Zhidong Dai
ABSTRACT Additively manufactured biomechanical models are increasingly being used in sports training simulations to evaluate athletes in terms of loading, risk of injury and performance optimisation. Their structural maintenance without destructive testing, however, is a major challenge and especially when their defects occur due to complicated interactions between materials and processes. The research introduces a fully non-invasive evaluation framework based on deep learning and capable of evaluating printed biomechanical structures with multimodal sensing and feature fusion. A multi-stage adaptive sensor fusion denoiser (MASFD) enhances deformation maps, displacement fields, strain contours, vibration spectra, bone-weighted dynamic fields and thermal gradients by removing modality-specific noise while preserving defect-sensitive features. The refined data are processed by a Hybrid Convolutional – Transformer Neural Network (HCTN-Net) integrating modality-specific CNN encoders, cross-modal attention fusion and dual-task prediction heads for defect classification and mechanical property regression. Experimental results demonstrate superior defect detection accuracy (96.8%), high F1-score (96.6%) and significantly reduced regression error (RMSE = 0.072, R 2 = 0.97) compared to baseline models. The proposed system enables reliable, contact-free, real-time structural assessment, supporting rapid design feedback and enhanced safety of sports training components.
- New
- Research Article
- 10.1093/jncics/pkag039
- Apr 15, 2026
- JNCI cancer spectrum
- Peiying Hua + 4 more
Accurate survival prediction for grade 2/3 glioma patients remains challenging due to tumor biological heterogeneity and limitations of current prognostic methods that rely on single-modality data. We developed a multimodal deep learning framework integrating histopathology whole-slide images, somatic mutations, and clinical-demographic data. A three-stage training pipeline combined contrastive learning with survival-specific optimization to align cross-modal representations. The framework was trained on 498 grade 2/3 glioma patients from TCGA and evaluated using 5-fold cross-validation and an independent Dartmouth-Hitchcock Medical Center (DHMC) cohort (n = 61). The contrastive multimodal model achieved a c-index of 0.91 (95% CI: 0.84 to 0.96), significantly outperforming the unimodal models (image-only = 0.76; non-image-only = 0.87) and showing an improvement over the non-contrastive multimodal model (c-index = 0.89), although this difference was not statistically significant. Kaplan-Meier analysis demonstrated clear survival separation across risk strata (log-rank P = 4.4 × 10-5). Contrastive learning improved representation clustering quality, with silhouette scores increasing from 0.20 to 0.24 (P = 0.05). External evaluation on the DHMC cohort achieved a c-index of 0.87 (95% CI: 0.77 to 0.95) after domain adaptation. Contrastive multimodal learning significantly enhances survival prediction in grade 2/3 gliomas by effectively integrating histopathology, genomics, and clinical data. This annotation-free approach enables early risk stratification using routinely collected data and shows promise for informing personalized treatment decisions and clinical trial stratification.
- New
- Research Article
- 10.1021/acsami.6c01457
- Apr 15, 2026
- ACS applied materials & interfaces
- Yusen Guo + 6 more
Achieving robust human-machine interaction in noisy, constrained, or speech-impaired environments remains a significant challenge for conventional voice-based systems. Here, we present a wearable, flexible, and multichannel piezoresistive interface capable of decoding laryngeal and submandibular motion during complex speech behaviors. The system integrates a micropyramid polydimethylsiloxane (PDMS) sensing layer coated with conductive polypyrrole (PPy) onto a multichannel electrode array supported by a flexible polyimide (PI) substrate, providing superior skin conformity, high strain sensitivity, and robust long-term stability. We developed a fully integrated hardware platform enabling four-channel synchronous data acquisition, wireless transmission, and real-time on-device processing. A modified Audio Spectrogram Transformer (AST) combined with a multichannel fusion mechanism enables end-to-end semantic recognition. Using a 14-word core English vocabulary, we constructed two structured datasets─Microphone and Vocal─comprising a total of 3,840 samples. The system achieved classification accuracies of 99.6% and 96.4%, respectively, highlighting strong generalizability, semantic clarity, and robustness against signal variability. Real-world evaluations confirm stable performance under motion, facial expressions, and background noise. By unifying soft materials engineering, flexible circuit integration, and multimodal deep learning, this work advances speech recognition in complex environments and offers a scalable solution for assistive communication, wearable AI, and silent interaction under extreme conditions.
- New
- Research Article
- 10.1016/j.ejso.2026.111796
- Apr 15, 2026
- European journal of surgical oncology : the journal of the European Society of Surgical Oncology and the British Association of Surgical Oncology
- Kaiting Han + 6 more
MRI-driven multimodal deep learning approach for predicting pathological complete response after neoadjuvant chemoradiotherapy in locally advanced rectal cancer: A multicenter study.
- New
- Research Article
- 10.1016/j.watres.2026.125499
- Apr 15, 2026
- Water research
- Fulin Shao + 5 more
Screening toxic transformation products of emerging pollutants in advanced oxidation processes with 3D deep learning and in vitro assays.