Advancements in Medical Radiology Through Multimodal Machine Learning: A Comprehensive Overview.
The majority of data collected and obtained from various sources over a patient's lifetime can be assumed to comprise pertinent information for delivering the best possible treatment. Medical data, such as radiographic and histopathology images, electrocardiograms, and medical records, all guide a physician's diagnostic approach. Nevertheless, most machine learning techniques in the healthcare field emphasize data analysis from a single modality, which is insufficiently reliable. This is especially evident in radiology, which has long been an essential topic of machine learning in healthcare because of its high data density, availability, and interpretation capability. In the future, computer-assisted diagnostic systems must be intelligent to process a variety of data simultaneously, similar to how doctors examine various resources while diagnosing patients. By extracting novel characteristics from diverse medical data sources, advanced identification techniques known as multimodal learning may be applied, enabling algorithms to analyze data from various sources and eliminating the need to train each modality. This approach enhances the flexibility of algorithms by incorporating diverse data. A growing quantity of current research has focused on the exploration of extracting data from multiple sources and constructing precise multimodal machine/deep learning models for medical examinations. A comprehensive analysis and synthesis of recent publications focusing on multimodal machine learning in detecting diseases is provided. Potential future research directions are also identified. This review presents an overview of multimodal machine learning (MMML) in radiology, a field at the cutting edge of integrating artificial intelligence into medical imaging. As radiological practices continue to evolve, the combination of various imaging and non-imaging data modalities is gaining increasing significance. This paper analyzes current methodologies, applications, and trends in MMML while outlining challenges and predicting upcoming research directions. Beginning with an overview of the different data modalities involved in radiology, namely, imaging, text, and structured medical data, this review explains the processes of modality fusion, representation learning, and modality translation, showing how they boost diagnosis efficacy and improve patient care. Additionally, this review discusses key datasets that have been instrumental in advancing MMML research. This review may help clinicians and researchers comprehend the spatial distribution of the field, outline the current level of advancement, and identify areas of research that need to be explored regarding MMML in radiology.
- Research Article
18
- 10.1109/jbhi.2025.3530156
- Jun 1, 2025
- IEEE journal of biomedical and health informatics
The application of machine learning in medicine and healthcare has led to the creation of numerous diagnostic and prognostic models. However, despite their success, current approaches generally issue predictions using data from a single modality. This stands in stark contrast with clinician decision-making which employs diverse information from multiple sources. While several multimodal machine learning approaches exist, significant challenges in developing multimodal systems remain that are hindering clinical adoption. In this paper, we introduce a multimodal framework, AutoPrognosis-M, that enables the integration of structured clinical (tabular) data and medical imaging using automated machine learning. AutoPrognosis-M incorporates 17 imaging models, including convolutional neural networks and vision transformers, and three distinct multimodal fusion strategies. In an illustrative application using a multimodal skin lesion dataset, we highlight the importance of multimodal machine learning and the power of combining multiple fusion strategies using ensemble learning. We have open-sourced our framework as a tool for the community and hope it will accelerate the uptake of multimodal machine learning in healthcare and spur further innovation.
- Research Article
31
- 10.3934/mbe.2023382
- Jan 1, 2023
- Mathematical biosciences and engineering : MBE
Nowadays, the increasing number of medical diagnostic data and clinical data provide more complementary references for doctors to make diagnosis to patients. For example, with medical data, such as electrocardiography (ECG), machine learning algorithms can be used to identify and diagnose heart disease to reduce the workload of doctors. However, ECG data is always exposed to various kinds of noise and interference in reality, and medical diagnostics only based on one-dimensional ECG data is not trustable enough. By extracting new features from other types of medical data, we can implement enhanced recognition methods, called multimodal learning. Multimodal learning helps models to process data from a range of different sources, eliminate the requirement for training each single learning modality, and improve the robustness of models with the diversity of data. Growing number of articles in recent years have been devoted to investigating how to extract data from different sources and build accurate multimodal machine learning models, or deep learning models for medical diagnostics. This paper reviews and summarizes several recent papers that dealing with multimodal machine learning in disease detection, and identify topics for future research.
- Research Article
24
- 10.1109/jbhi.2024.3448238
- Nov 1, 2024
- IEEE journal of biomedical and health informatics
Stroke is a life-threatening medical condition that could lead to mortality or significant sensorimotor deficits. Various machine learning techniques have been successfully used to detect and predict stroke-related outcomes. Considering the diversity in the type of clinical modalities involved during management of patients with stroke, such as medical images, bio-signals, and clinical data, multimodal machine learning has become increasingly popular. Thus, we conducted a systematic literature review to understand the current status of state-of-the-art multimodal machine learning methods for stroke prognosis and diagnosis. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines during literature search and selection, our results show that the most dominant techniques are related to the fusion paradigm, specifically early, joint and late fusion. We discuss opportunities to leverage other multimodal learning paradigms, such as multimodal translation and alignment, which are generally less explored. We also discuss the scale of datasets and types of modalities used to develop existing models, highlighting opportunities for the creation of more diverse multimodal datasets. Finally, we present ongoing challenges and provide a set of recommendations to drive the next generation of multimodal learning methods for improved prognosis and diagnosis of patients with stroke.
- Research Article
28
- 10.1109/tpami.2024.3420239
- Dec 1, 2024
- IEEE transactions on pattern analysis and machine intelligence
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various sources has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the intrinsic nature of different data modalities. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions, and then present the methodological advancements categorized by the combination of data modalities, such as Vision+Text, with slightly inclined emphasis on the visual data. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, and the rhythm correspondence between video dance moves and musical beats. We hope that the exploitation of the alignment as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address a specific challenge related to the concrete multimodal task, prompting a unified multimodal machine learning framework closer to a real human intelligence system.
- Research Article
4
- 10.1016/j.jdent.2023.104588
- Jun 21, 2023
- Journal of Dentistry
Multi-modal deep learning for automated assembly of periapical radiographs
- Research Article
3
- 10.1016/j.compmedimag.2025.102526
- Jul 1, 2025
- Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society
In Pancreatic Ductal Adenocarcinoma (PDAC), predicting genetic mutations directly from histopathological images using Deep Learning can provide valuable insights. The combination of several omics can provide further knowledge on mechanisms underlying tumor biology. This study aimed at developing an explainable multimodal pipeline to predict genetic mutations for the KRAS, TP53, SMAD4, and CDKN2A genes, integrating pathomic features with transcriptomics from two independent datasets, the TCGA-PAAD, assumed as training set, and the CPTAC-PDA, as external validation set. Large and small configurations of CLAM (Clustering-constrained Attention Multiple Instance Learning) models were evaluated with three different feature extractors (ResNet50, UNI, and CONCH). RNA-seq data were pre-processed both conventionally and using three autoencoder architectures. The processed transcript panels were input into machine learning (ML) models for mutation classification. Attention maps and SHAP were employed, highlighting significant features from both data modalities. A fusion layer or a voting mechanism combined the outputs from pathomic and transcriptomic models, obtaining a multimodal prediction. Performance comparisons were assessed by Area Under Receiver Operating Characteristic (AUROC) and Precision-Recall (AUPRC) curves. On the validation set, for KRAS, multimodal ML achieved 0.92 of AUROC and 0.98 of AUPRC. For TP53, the multimodal voting model achieved 0.75 of AUROC and 0.85 of AUPRC. For SMAD4 and CDKN2A, transcriptomic ML models achieved AUROC of 0.71 and 0.65, while multimodal ML showed AUPRC of 0.39 and 0.37, respectively. This approach demonstrated the potential of combining pathomics with transcriptomics, offering an interpretable framework for predicting key genetic mutations in PDAC.
- Research Article
6
- 10.13374/j.issn2095-9389.2019.03.21.003
- May 1, 2020
- SHILAP Revista de lepidopterología
“Big data” is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand “big data”. The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person’s understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of “media”, a “mode” is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field.
- Research Article
28
- 10.18034/ajhal.v4i2.658
- Dec 31, 2017
- Asian Journal of Humanity, Art and Literature
A modality is an event or experience. Life is multimodal, see, hear, smell, feel, and taste. Multimodal experiences involve some world modalities. Artificial intelligence must grasp multimodal views to understand our surroundings. Multimodal machine learning models interact and correlate input from several modalities. It's a multi-disciplinary field with great potential. In this study, we analyze emerging multimodal machine learning technologies and categorize them scientifically rather than focusing on specific multimodal applications. Multimodal machine learning offers more potential and problems than classifications. Most multimodal learning research collects quantitative data from polls and surveys. This research reviews a detailed library of observational studies on multimodal data (MMD) skills for human learning using artificial intelligence-powered approaches including Machine Learning and Deep Learning. This research also describes how MMD has improved learning and in what environments. This paper discusses multimodal learning and its ongoing improvements and approaches to improving learning. Finally, future researchers should carefully consider building a system that aligns multimodal aspects with the study and learning plan. These elements could enhance multimodal learning by facilitating theory and practice activities. This research lays the groundwork for multimodal data use in future learning technologies and development.
- Research Article
1
- 10.2174/0118722121291771240216044918
- Feb 1, 2025
- Recent Patents on Engineering
: Artificial intelligence (AI) has made its own place in the present world. Almost in every field, AI is being utilized for betterment and advancement. Machine learning (ML) is a part of AI and has been applied extensively currently in various fields of science and technology including healthcare system. ML is the technique that uses AI to analyze, interpret and make decisions. : To summarize the applications of ML in various healthcare systems in order to understand the strength and loopholes of the use of ML in medical science. : The mechanisms and methods of ML approach in various medical issues have been analyzed and discussed. ML technique is being used to make decisions in medical cases, for determining the treatment regime of a particular patient, for designing and developing drugs, in personalized medicine, in designing and selecting diagnoses for any particular disease, for automated tracking of patient's recovery. Available clinical data and history are being used by ML techniques to compare, classify, select and execute results for any task being assigned. In a nutshell, ML uses earlier available information and data about the disease, the treatment protocols followed, and the results in correspondence with the clinical symptoms and pathological findings. : Several achievements using ML in the healthcare system, yielded significant novel results that have been patented. There have been several thousand patents in the field of application of ML in healthcare systems from the years 2012 to 2023. : Though, ML in healthcare comes with some risks and unknown possibilities yet, restricted and monitored application of ML in healthcare may hasten the healthcare system, save time, help to make efficient decisions in non-invasive ways, and may open up new possibilities in the healthcare system.
- Research Article
- 10.55041/ijsrem52491
- Sep 9, 2025
- INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Abstract— Processing data-driven healthcare allowed us unprecedented chances to enhance diagnoses, foreseen, and customized treatment by means of multi-modal learning. The present paper discusses the development of electronic health records (EHR), medical images, and genomic data through multi-modal deep learning. Multi-modal models are able to capture richer feature representations and more complex patterns not visible with unimodal processing through the use of heterogeneous data sources, and thus by combining their complementary strengths. We propose an end-to-end protocol to align, preprocess, and fuse modalities and demonstrate an application of deep neural networks learning in tandem about these structured pieces of EHR and high dimensional imaging attributes alongside gene expression data. Through experiments, it is revealed that the proposed model has better performance on the task of disease classification and patient stratification compared to single-modality counterparts. The paper highlights the need to not only ensure data alignment, imputation of missing modalities and learning representations specifically in the domain of modalities to fully utilize multi-modal in the clinical context. Keywords— Multi-modal Learning, Electronic Health Records (EHR), Medical Imaging, Genomic Data, Deep Learning, Data Fusion, Healthcare AI, Precision Medicine, Patient Stratification, Biomedical Informatics.
- Conference Article
10
- 10.1145/3580305.3599208
- Aug 4, 2023
The recent advancements in machine learning and artificial intelligence (particularly foundation models such as BERT, GPT-3, T5, ResNet, etc.) have demonstrated remarkable capabilities and driven significant revolutionary changes to the way we make inferences from complex data. These models represent a fundamental shift in the way data are approached and offer exciting new research directions and opportunities for multimodal learning and data fusion. Given the potential of foundation models to transform the field of multimodal learning, there is a need to bring together experts and researchers to discuss the latest developments in this area, exchange ideas, and identify key research questions and challenges that need to be addressed. By hosting this workshop, we aim to create a forum for researchers to share their insights and expertise on multimodal data fusion and learning using foundation models, and to explore potential new research directions and applications in the rapidly evolving field. We expect contributions from interdisciplinary researchers to study and model interactions between (but not limited to) modalities of language, graphs, time-series, vision, tabular data, sensors, and more. Our workshop will emphasize interdisciplinary work and aim at seeding cross-team collaborations around new tasks, datasets, and models.
- Research Article
83
- 10.1109/access.2023.3243854
- Jan 1, 2023
- IEEE Access
Multimodal machine learning (MML) is a tempting multidisciplinary research area where heterogeneous data from multiple modalities and machine learning (ML) are combined to solve critical problems. Usually, research works use data from a single modality, such as images, audio, text, and signals. However, real-world issues have become critical now, and handling them using multiple modalities of data instead of a single modality can significantly impact finding solutions. ML algorithms play an essential role by tuning parameters in developing MML models. This paper reviews recent advancements in the challenges of MML, namely: representation, translation, alignment, fusion and co-learning, and presents the gaps and challenges. A systematic literature review (SLR) applied to define the progress and trends on those challenges in the MML domain. In total, 1032 articles were examined in this review to extract features like source, domain, application, modality, etc. This research article will help researchers understand the constant state of MML and navigate the selection of future research directions.
- Research Article
3
- 10.1371/journal.pdig.0000755
- May 14, 2025
- PLOS digital health
Progression free survival (PFS) is a critical clinical outcome endpoint during cancer management and treatment evaluation. Yet, PFS is often missing from publicly available datasets due to the current subjective, expert, and time-intensive nature of generating PFS metrics. Given emerging research in multi-modal machine learning (ML), we explored the benefits and challenges associated with mining different electronic health record (EHR) data modalities and automating extraction of PFS metrics via ML algorithms. We analyzed EHR data from 92 pathology-proven GBM patients, obtaining 233 corticosteroid prescriptions, 2080 radiology reports, and 743 brain MRI scans. Three methods were developed to derive clinical PFS: 1) frequency analysis of corticosteroid prescriptions, 2) natural language processing (NLP) of reports, and 3) computer vision (CV) volumetric analysis of imaging. Outputs from these methods were compared to manually annotated clinical guideline PFS metrics. Employing data-driven methods, standalone progression rates were 63% (prescription), 78% (NLP), and 54% (CV), compared to the 99% progression rate from manually applied clinical guidelines using integrated data sources. The prescription method identified progression an average of 5.2 months later than the clinical standard, while the CV and NLP algorithms identified progression earlier by 2.6 and 6.9 months, respectively. While lesion growth is a clinical guideline progression indicator, only half of patients exhibited increasing contrast-enhancing tumor volumes during scan-based CV analysis. Our results indicate that data-driven algorithms can extract tumor progression outcomes from existing EHR data. However, ML methods are subject to varying availability bias, supporting contextual information, and pre-processing resource burdens that influence the extracted PFS endpoint distributions. Our scan-based CV results also suggest that the automation of clinical criteria may not align with human intuition. Our findings indicate a need for improved data source integration, validation, and revisiting of clinical criteria in parallel to multi-modal ML algorithm development.
- Research Article
2
- 10.36001/phmap.2023.v4i1.3783
- Sep 4, 2023
- PHM Society Asia-Pacific Conference
Prognostics and Health Management (PHM) is identified as an important lever for enhancing the development of predictive maintenance to ensure the reliability, availability, and safety of industrial systems. However, the efficiency of data- driven PHM approaches is dependent on the quality and quantity of data. Therefore, exploiting multiple data sources can provide additional, useful information than single-modal data. For instance, by incorporating multiple data sources, including condition monitoring data, images from cameras, and texts from maintenance technicians’ reports, multi-modal learning can provide a more comprehensive and accurate understanding of the system’s health. However, multi-modal deep learning is complex to understand. To address this complexity, it is crucial to incorporate explainable artificial intelligent techniques to provide clear and interpretable insights into how the model makes decisions. In this light, this paper proposes the application of the model-agnostic-explanation approach, i.e., SHAP, to explain the working mechanism of multimodal learning for the prediction of industrial steam generator degradation. Particularly, we determine the important features of each data modality and investigate how multimodal learning can overcome the issues of low-quality data from a single modality due to the additional information from other data modalities.
- Research Article
36
- 10.1016/j.eswa.2023.121168
- Aug 12, 2023
- Expert Systems with Applications
A survey on multimodal bidirectional machine learning translation of image and natural language processing