Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Multi-Modal Learning Approaches Combining EHR, Imaging, and Genomic Data

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Abstract— Processing data-driven healthcare allowed us unprecedented chances to enhance diagnoses, foreseen, and customized treatment by means of multi-modal learning. The present paper discusses the development of electronic health records (EHR), medical images, and genomic data through multi-modal deep learning. Multi-modal models are able to capture richer feature representations and more complex patterns not visible with unimodal processing through the use of heterogeneous data sources, and thus by combining their complementary strengths. We propose an end-to-end protocol to align, preprocess, and fuse modalities and demonstrate an application of deep neural networks learning in tandem about these structured pieces of EHR and high dimensional imaging attributes alongside gene expression data. Through experiments, it is revealed that the proposed model has better performance on the task of disease classification and patient stratification compared to single-modality counterparts. The paper highlights the need to not only ensure data alignment, imputation of missing modalities and learning representations specifically in the domain of modalities to fully utilize multi-modal in the clinical context. Keywords— Multi-modal Learning, Electronic Health Records (EHR), Medical Imaging, Genomic Data, Deep Learning, Data Fusion, Healthcare AI, Precision Medicine, Patient Stratification, Biomedical Informatics.

Similar Papers
  • Dissertation
  • 10.32657/10356/182346
Data efficient deep multimodal learning
  • Jan 1, 2025
  • Meng Shen

Multimodal learning, which enables neural networks to process and integrate information from various sensory modalities such as vision, language, and sound, has become increasingly important in applications ranging from affective computing and healthcare to advanced multimodal chatbots. Despite its potential, multimodal learning faces significant challenges, particularly in the area of data efficiency. The requirement for large, high-quality datasets from multiple modalities presents a substantial barrier, limiting the scalability and accessibility of large multimodal models. This dissertation addresses several key issues in data-efficient deep multimodal learning, focusing on the imbalanced multimodal data selection, the cold-start problem in multimodal active learning, and the mitigation of hallucinations in large vision-language models. Firstly, we analyze the limitations of conventional active learning strategies, which tend to favor dominant modalities, leading to unbalanced multimodal models that neglect weaker modalities. To overcome this, we propose a gradient embedding modulation method that ensures a more equitable data selection process across modalities, resulting in models that fairly uilize both weak and strong modalities. Building on our work in warm-start active learning, we tackle the cold-start problem in multimodal active learning, where no initial labels are available for warm-start data selection. We develop a two-stage approach that first reduces the modality representation gap through multimodal self-supervised learning, utilizing unimodal prototypes to harmonize representations across modalities. In the subsequent data selection stage, we introduce a regularization term to maximize modality alignment, leading to improved model performance using the same amount of data compared to existing methods. Extending our focus from data selection to the usage of training data, we address the challenge of hallucinations in large vision-language models, where the models generate content that is incorrect in the context of input images. We investigate the relationship between hallucinations and visual dependence of tokens, revealing that certain tokens contribute disproportionately to these hallucinatory. Based on this insight, we propose an approach that adjusts training weights according to the visual dependence of tokens, thereby reducing the hallucination rate without requiring additional training data or inference costs. The contributions of this thesis offer significant advancements in the field of dataefficient multimodal learning. By developing novel methods for balancing multimodal data selection, addressing cold-start problem in multimodal active learning, and mitigating hallucinations in large vision-language models, this work paves the way for more practical and scalable multimodal learning systems that require less data and computational effort while achieving superior performance.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.jdent.2023.104588
Multi-modal deep learning for automated assembly of periapical radiographs
  • Jun 21, 2023
  • Journal of Dentistry
  • L Pfänder + 5 more

Multi-modal deep learning for automated assembly of periapical radiographs

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 17
  • 10.3389/frai.2023.1247195
Multimodal deep learning for liver cancer applications: a scoping review.
  • Oct 27, 2023
  • Frontiers in artificial intelligence
  • Aisha Siam + 5 more

Hepatocellular carcinoma is a malignant neoplasm of the liver and a leading cause of cancer-related deaths worldwide. The multimodal data combines several modalities, such as medical images, clinical parameters, and electronic health record (EHR) reports, from diverse sources to accomplish the diagnosis of liver cancer. The introduction of deep learning models with multimodal data can enhance the diagnosis and improve physicians' decision-making for cancer patients. This scoping review explores the use of multimodal deep learning techniques (i.e., combining medical images and EHR data) in diagnosing and prognosis of hepatocellular carcinoma (HCC) and cholangiocarcinoma (CCA). A comprehensive literature search was conducted in six databases along with forward and backward references list checking of the included studies. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) extension for scoping review guidelines were followed for the study selection process. The data was extracted and synthesized from the included studies through thematic analysis. Ten studies were included in this review. These studies utilized multimodal deep learning to predict and diagnose hepatocellular carcinoma (HCC), but no studies examined cholangiocarcinoma (CCA). Four imaging modalities (CT, MRI, WSI, and DSA) and 51 unique EHR records (clinical parameters and biomarkers) were used in these studies. The most frequently used medical imaging modalities were CT scans followed by MRI, whereas the most common EHR parameters used were age, gender, alpha-fetoprotein AFP, albumin, coagulation factors, and bilirubin. Ten unique deep-learning techniques were applied to both EHR modalities and imaging modalities for two main purposes, prediction and diagnosis. The use of multimodal data and deep learning techniques can help in the diagnosis and prediction of HCC. However, there is a limited number of works and available datasets for liver cancer, thus limiting the overall advancements of AI for liver cancer applications. Hence, more research should be undertaken to explore further the potential of multimodal deep learning in liver cancer applications.

  • Abstract
  • 10.1016/j.hrthm.2023.03.296
CE-452775-2 MARS-HCM: MULTI-MODAL DEEP LEARNING METHOD FOR VENTRICULAR ARRHYTHMIA (VA) RISK STRATIFICATION IN HYPERTROPHIC CARDIOMYOPATHY (HCM) PATIENTS
  • May 1, 2023
  • Heart Rhythm
  • Changxin Lai + 10 more

CE-452775-2 MARS-HCM: MULTI-MODAL DEEP LEARNING METHOD FOR VENTRICULAR ARRHYTHMIA (VA) RISK STRATIFICATION IN HYPERTROPHIC CARDIOMYOPATHY (HCM) PATIENTS

  • Book Chapter
  • Cite Count Icon 3
  • 10.2174/9789815305128124010008
Multimodal Deep Learning in Medical Diagnostics: A Comprehensive Exploration of Cardiovascular Risk Prediction
  • Oct 10, 2024
  • Sonia Raj + 1 more

Machine learning algorithms have been important in identifying and predicting cardiovascular risk. These algorithms use a variety of data sources, including patient histories, clinical measures, and electronic health records, to discover people who could get cardiovascular problems. Methods of deep learning, a subset of machine learning hold the promise of enhancing the accuracy and effectiveness of cardiovascular risk prediction models. In this research, retinal images, clinical data, and various clinical features are employed to harness the capabilities of multimodal deep learning for predicting cardiovascular risk. The integration of these modalities enables a holistic assessment of an individual's cardiovascular health, contributing to the advancement of precision medicine in the realm of Cardiovascular Disease (CVD). The impact of this research extends beyond cardiovascular risk prediction, as it exemplifies the transformative potential of machine learning in healthcare. By empowering medical challenges with cutting-edge technology, our work addresses the urgent need for early risk assessment, patient stratification, and personalized interventions. This showcases how the synergy of different data types and deep learning can lead to improved clinical decision support, reduced healthcare costs, and, ultimately, enhanced patient outcomes. The potential to deploy such multimodal deep learning models in clinical practice has the potential to revolutionize the field of cardiovascular health and set a precedent for the broader role of machine learning in healthcare.

  • Research Article
  • 10.64149/j.ver.8.18s.128-136
Multimodal Deep Learning System for Early Detection of Chronic Diseases using Medical Images + EHR Data
  • Jan 1, 2025
  • Vascular and Endovascular Review
  • Anusha Jain, Priyanka Dhasal, Sonal Modh Bhardwaj

Early detection of chronic diseases is essential for reducing long-term health complications and improving patient survival outcomes. Traditional diagnostic systems rely heavily on single-modality data, such as medical imaging or clinical records, which often fail to capture the multidimensional nature of chronic disease progression. This research presents a multimodal deep learning framework that integrates medical images with Electronic Health Records (EHR) to enhance early disease prediction. The proposed system utilizes a Convolutional Neural Network (CNN) for extracting structural and morphological patterns from imaging modalities such as MRI, CT, X-ray, and retinal fundus images. In parallel, an LSTM/Transformer-based encoder processes EHR variables, including laboratory values, comorbidities, vitals, and demographic information. The latent representations from both modalities are fused using an intermediate multimodal fusion strategy to generate a unified patient-level diagnostic prediction. Experimental results show that the proposed multimodal model significantly outperforms image-only and EHR-only models, achieving an overall accuracy of 92.8%, an F1-score of 91.0%, and an AUC of 0.96. Per-class analysis demonstrates substantial improvement in detecting early-stage conditions such as diabetic retinopathy, chronic kidney disease, cardiovascular diseases, and COPD. The inclusion of Grad-CAM and SHAP-based interpretability analyses further enhances the clinical reliability of the model. Overall, the findings confirm that integrating imaging and EHR data through multimodal deep learning provides a more comprehensive and accurate approach for early chronic disease detection and has strong potential for real clinical implementation.

  • Research Article
  • Cite Count Icon 1
  • 10.1158/1538-7445.am2024-2313
Abstract 2313: Multi-modal deep learning to predict cancer outcomes by integrating radiology and pathology images
  • Mar 22, 2024
  • Cancer Research
  • Zhe Li + 2 more

Purpose: Cancer patients routinely undergo radiologic and pathologic evaluation for their diagnostic workup. These data modalities represent a valuable and readily available resource for developing new prognostic tools. Given their vast difference in spatial scales, effective methods to integrate the two modalities are currently lacking. Here, we aim to develop a multi-modal approach to integrate radiology and pathology images for predicting outcomes in cancer patients. Methods: We propose a multi-modal weakly-supervised deep learning framework to integrate radiology and pathology images for survival prediction. We first extract multi-scale features from whole-slide H&E-stained pathology images to characterize cellular and tissue phenotypes as well as spatial cellular organization. We then build a hierarchical co-attention transformer to effectively learn the multi-modal interactions between radiology and pathology image features. Finally, a multimodal risk score is derived by combining complementary information from two images modalities and clinical data for predicting outcome. We evaluate our approach in lung, gastric, and brain cancers with matched radiology and pathology images and clinical data available, each with separate training and external validation cohorts. Results: The multi-modal deep learning models achieved a reasonably high accuracy for predicting survival outcomes in the external validation cohorts (C-index range: 0.72-0.75 across three cancer types). The multi-modal prognostic models significantly improved upon single-modal approach based on radiology or pathology images or clinical data alone (C-index range: 0.53-0.71, P<0.01). The multi-modal deep learning models were significantly associated with disease-free survival and overall survival (hazard ratio range: 3.23-4.46, P<0.0001). In multivariable analyses, the models remained an independent prognostic factor (P<0.01) after adjusting for clinicopathological variables including cancer stage and tumor differentiation. Conclusions: The proposed multi-modal deep learning approach outperforms traditional methods for predicting survival outcomes by leveraging routinely available radiology and pathology images. With further independent validation, this may afford a promising approach to improve risk stratification and better inform treatment strategies for cancer patients. Citation Format: Zhe Li, Yuming Jiang, Ruijiang Li. Multi-modal deep learning to predict cancer outcomes by integrating radiology and pathology images [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2313.

  • Book Chapter
  • Cite Count Icon 33
  • 10.1007/978-3-031-16443-9_60
Survival Prediction of Brain Cancer with Incomplete Radiology, Pathology, Genomic, and Demographic Data
  • Jan 1, 2022
  • Can Cui + 9 more

Integrating cross-department multi-modal data (e.g., radiology, pathology, genomic, and demographic data) is ubiquitous in brain cancer diagnosis and survival prediction. To date, such an integration is typically conducted by human physicians (and panels of experts), which can be subjective and semi-quantitative. Recent advances in multi-modal deep learning, however, have opened a door to leverage such a process in a more objective and quantitative manner. Unfortunately, the prior arts of using four modalities on brain cancer survival prediction are limited by a “complete modalities” setting (i.e., with all modalities available). Thus, there are still open questions on how to effectively predict brain cancer survival from incomplete radiology, pathology, genomic, and demographic data (e.g., one or more modalities might not be collected for a patient). For instance, should we use both complete and incomplete data, and more importantly, how do we use such data? To answer the preceding questions, we generalize the multi-modal learning on cross-department multi-modal data to a missing data setting. Our contribution is three-fold: 1) We introduce a multi-modal learning with missing data (MMD) pipeline with competitive performance and less hardware consumption; 2) We extend multi-modal learning on radiology, pathology, genomic, and demographic data into missing data scenarios; 3) A large-scale public dataset (with 962 patients) is collected to systematically evaluate glioma tumor survival prediction using four modalities. The proposed method improved the C-index of survival prediction from 0.7624 to 0.8053.KeywordsMulti-modal learningSurvival predictionMissing modalities

  • Research Article
  • Cite Count Icon 160
  • 10.1145/3545572
A Review on Methods and Applications in Multimodal Deep Learning
  • Feb 17, 2023
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Summaira Jabeen + 5 more

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.

  • Research Article
  • Cite Count Icon 58
  • 10.1016/j.imavis.2025.105509
A systematic review of intermediate fusion in multimodal deep learning for biomedical applications
  • May 1, 2025
  • Image and Vision Computing
  • Valerio Guarrasi + 6 more

Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL. • Comprehensive review of intermediate fusion in multimodal learning in biomedicine. • Structured notation for categorizing intermediate fusion methods. • Analysis of the benefits and challenges of intermediate fusion in biomedical contexts. • Identification of future research directions for improving current fusion techniques. • Versatile framework applicable to other multimodal deep learning domains.

  • Research Article
  • 10.2337/db25-1879-lb
1879-LB: Predicting Diabetic Peripheral Neuropathy in Type 2 Diabetes Using a Multimodal Model Integrating Foot Radiograph and Electronic Medical Records
  • Jun 20, 2025
  • Diabetes
  • Chae Won Chung + 6 more

Introduction and Objective: The global rise in diabetes has led to increased chronic complications, with diabetic peripheral neuropathy (DPN) being the most common. Undiagnosed DPN can progress to diabetic foot ulcers, making early screening at type 2 diabetes (T2D) diagnosis crucial. However, an optimal and cost-effective diagnostic strategy has yet to be established. This study aims to develop a multimodal deep learning model integrating foot radiographs and electronic medical records (EMRs) to improve DPN prediction. Methods: We utilized a small dataset consisting of 133 patients with 607 foot radiograph images and 133 EMRs from an internal dataset, and 29 patients with 111 foot radiograph images and 29 EMRs from an external dataset. To augment the data, we applied cropping, and utilized fine-tuning and multimodal learning to enhance performance. Our model was trained on foot radiograph images using ResNet50, while EMR data were concatenated with the image feature map at the classifier to build the multimodal model. Model performance was evaluated based on AUC, specificity, and accuracy. Results: Among the training cohort, 72 patients were classified as DPN (-) and 61 as DPN (+). The mean age was significantly lower in the DPN (+) group than in the DPN (-) group (59 ± 14 vs. 65 ± 12 years, P = 0.023), whereas HbA1c levels were comparable between the two groups (8.1 ± 2.2% vs. 7.8 ± 1.9%, P = 0.521). The proposed multimodal deep learning model achieved an AUC of 0.894 and an accuracy of 0.841 on the internal dataset, and an AUC of 0.723 and an accuracy of 0.730 on the external dataset. Notably, the multimodal approach outperformed single-input models, which exhibited lower AUC and accuracy in both internal and external testing. Conclusion: Despite a small dataset, our model showed strong predictive performance for DPN. In light of the lack of cost-effective diagnostic tools, this approach could serve as a valuable screening aid. Further studies with larger datasets are needed for broader clinical application. Disclosure C. Chung: None. Y. Jang: None. M. Kwon: None. J. Moon: None. K. Kim: None. G. Lee: None. J. Kim: None. Funding Korean Diabetes Association (2023F-7)

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s10278-025-01566-8
Multimodal Deep Learning Based on Ultrasound Images and Clinical Data for Better Ovarian Cancer Diagnosis.
  • Jun 24, 2025
  • Journal of imaging informatics in medicine
  • Chang Su + 8 more

This study aimed to develop and validate a multimodal deep learning model that leverages 2D grayscale ultrasound (US) images alongside readily available clinical data to improve diagnostic performance for ovarian cancer (OC). A retrospective analysis was conducted involving 1899 patients who underwent preoperative US examinations and subsequent surgeries for adnexal masses between 2019 and 2024. A multimodal deep learning model was constructed for OC diagnosis and extracting US morphological features from the images. The model's performance was evaluated using metrics such as receiver operating characteristic (ROC) curves, accuracy, and F1 score. The multimodal deep learning model exhibited superior performance compared to the image-only model, achieving areas under the curves (AUCs) of 0.9393 (95% CI 0.9139-0.9648) and 0.9317 (95% CI 0.9062-0.9573) in the internal and external test sets, respectively. The model significantly improved the AUCs for OC diagnosis by radiologists and enhanced inter-reader agreement. Regarding US morphological feature extraction, the model demonstrated robust performance, attaining accuracies of 86.34% and 85.62% in the internal and external test sets, respectively. Multimodal deep learning has the potential to enhance the diagnostic accuracy and consistency of radiologists in identifying OC. The model's effective feature extraction from ultrasound images underscores the capability of multimodal deep learning to automate the generation of structured ultrasound reports.

  • Preprint Article
  • 10.5194/egusphere-egu23-5818
Application of multimodal deep learning using radar and water level data for water level prediction
  • May 15, 2023
  • Seongsim Yoon + 2 more

In general, water level prediction models using deep learning techniques have been developed using time-series water level observation data from upstream water level stations and target water level stations even though many of physical data are necessary to predict water level. The changes of the water level are greatly affected by rainfall in the basin, therefore rainfall information is needed to more accurately predict the water level. In particular, radar data has the advantage of being able to directly acquire the amount of rainfall occurring within a watershed. This study aims to develop the multimodal deep learning model to predict the water level using 2D grid radar rainfall data and 1D time-series water level observation data. This study proposed two multimodal deep learning models which have different structures. Both multimodal deep learning models predict the water level by simultaneously using the observed water level data up to the present time and the radar rainfall data that affects the water level in the future. The first proposed model consists of a deep learning network that links 2D Average Pooling (AvgPool2D), which compresses 2D radar data to 1D data, and Long Short-Term Memory (LSTM), which predicts 1D time series water level data. The second proposed model consists of a deep learning network that predicts water levels by linking Conv2DLSTM and LSTM, which can reflect the characteristics of 2D radar data without deformation.  The two proposed multimodal deep learning models were learned and evaluated in the upper basin of Hantan River. In addition, it was compared with the results of single-modal LSTM using only water level data. There are three water level stations in the study area, and the objective was to predict the water level of the downstream station up to 180 minutes in advance. For learning and verification of the deep learning model, 10-minute water level and radar rainfall data were collected from May 2019 to October 2021. For the radar data used as input, the grid data included in the target watershed were extracted and used among composite radar data with a resolution of 1 km operating by Ministry of Environment. As a result of evaluating each learned deep learning model, two multimodal models had higher prediction accuracy than the single-modal using only water level data. In particular, second proposed model (Conv2dLSTM+LSTM) had better predictive performance than first proposed model (AvgPool2D+LSTM) at the time of the sudden rise in water level due to rainfall.AcknowledgmentsResearch for this paper was carried out under the KICT Research Program (project no. 202200175-001, Development of future-leading technologies solving water crisis against to water disasters affected by climate change) funded by the Ministry of Science and ICT.

  • Research Article
  • Cite Count Icon 10
  • 10.3390/bioengineering12050477
Advancements in Medical Radiology Through Multimodal Machine Learning: A Comprehensive Overview.
  • Apr 30, 2025
  • Bioengineering (Basel, Switzerland)
  • Imran Ul Haq + 5 more

The majority of data collected and obtained from various sources over a patient's lifetime can be assumed to comprise pertinent information for delivering the best possible treatment. Medical data, such as radiographic and histopathology images, electrocardiograms, and medical records, all guide a physician's diagnostic approach. Nevertheless, most machine learning techniques in the healthcare field emphasize data analysis from a single modality, which is insufficiently reliable. This is especially evident in radiology, which has long been an essential topic of machine learning in healthcare because of its high data density, availability, and interpretation capability. In the future, computer-assisted diagnostic systems must be intelligent to process a variety of data simultaneously, similar to how doctors examine various resources while diagnosing patients. By extracting novel characteristics from diverse medical data sources, advanced identification techniques known as multimodal learning may be applied, enabling algorithms to analyze data from various sources and eliminating the need to train each modality. This approach enhances the flexibility of algorithms by incorporating diverse data. A growing quantity of current research has focused on the exploration of extracting data from multiple sources and constructing precise multimodal machine/deep learning models for medical examinations. A comprehensive analysis and synthesis of recent publications focusing on multimodal machine learning in detecting diseases is provided. Potential future research directions are also identified. This review presents an overview of multimodal machine learning (MMML) in radiology, a field at the cutting edge of integrating artificial intelligence into medical imaging. As radiological practices continue to evolve, the combination of various imaging and non-imaging data modalities is gaining increasing significance. This paper analyzes current methodologies, applications, and trends in MMML while outlining challenges and predicting upcoming research directions. Beginning with an overview of the different data modalities involved in radiology, namely, imaging, text, and structured medical data, this review explains the processes of modality fusion, representation learning, and modality translation, showing how they boost diagnosis efficacy and improve patient care. Additionally, this review discusses key datasets that have been instrumental in advancing MMML research. This review may help clinicians and researchers comprehend the spatial distribution of the field, outline the current level of advancement, and identify areas of research that need to be explored regarding MMML in radiology.

  • Research Article
  • Cite Count Icon 1
  • 10.1051/0004-6361/202553751
Estimation of age and metallicity for galaxies based on multi-modal deep learning
  • Jun 1, 2025
  • Astronomy & Astrophysics
  • Ping Li + 4 more

Aims. This study is aimed at deriving the age and metallicity of galaxies by proposing a novel multi-modal deep learning framework. This multi-modal framework integrates spectral and photometric data, offering advantages in cases where spectra are incomplete or unavailable. Methods. We propose a multi-modal learning method for estimating the age and metallicity of galaxies (MMLforGalAM). This method uses two modalities: spectra and photometric images as training samples. Its architecture consists of four models: a spectral feature extraction model (ℳ1), a simulated spectral feature generation model (ℳ2), an image feature extraction model (ℳ3), and a multi-modal attention regression model (ℳ4). Specifically, ℳ1 extracts spectral features associated with age and metallicity from spectra observed by the Sloan Digital Sky Survey (SDSS). These features are then used as labels to train ℳ2, which generates simulated spectral features for photometric images to address the challenge of missing observed spectra for some images. Overall, ℳ1 and ℳ2 provide a transformation from photometric to spectral features, with the goal of constructing a spectral representation of data pairs (photometric and spectral features) for multi-modal learning. Once ℳ2 is trained, MMLforGalAM can then be applied to scenarios with only images, even in the absence of spectra. Then, ℳ3 processes SDSS photometric images to extract features related to age and metallicity. Finally, ℳ4 combines the simulated spectral features from ℳ2 with the extracted image features from ℳ3 to predict the age and metallicity of galaxies. Results. Trained on 36278 galaxies from SDSS, our model predicts the stellar age and metallicity, with a scatter of 1σ = 0.1506 dex for age and 1 σ = 0.1402 dex for metallicity. Compared to a single-modal model trained using only images, the multi-modal approach reduces the scatter by 27% for age and 15% for metallicity.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant