Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Domain adaptation and generalization are crucial for real-world applications, such as autonomous driving and medical imaging where the model must operate reliably across environments with distinct data distributions. However, these tasks are challenging because the model needs to overcome various domain gaps caused by variations in, for example, lighting, weather, sensor configurations, and so on. Addressing domain gaps simultaneously in different modalities, known as multimodal domain adaptation and generalization, is even more challenging due to unique challenges in different modalities. Over the past few years, significant progress has been made in these areas, with applications ranging from action recognition to semantic segmentation, and more. Recently, the emergence of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired numerous research studies, which leverage these models to enhance downstream adaptation and generalization. This survey summarizes recent advances in multimodal adaptation and generalization, particularly how these areas evolve from traditional approaches to foundation models. Specifically, this survey covers (1) multimodal domain adaptation, (2) multimodal test-time adaptation, (3) multimodal domain generalization, (4) domain adaptation and generalization with the help of multimodal foundation models, and (5) adaptation of multimodal foundation models. For each topic, we formally define the problem and give a thorough review of existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions.

Similar Papers
  • Research Article
  • Cite Count Icon 44
  • 10.1007/s10462-024-10915-y
Few-shot adaptation of multi-modal foundation models: a survey
  • Aug 27, 2024
  • Artificial Intelligence Review
  • Fan Liu + 6 more

Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: (1) prompt-based methods, (2) adapter-based methods, and (3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: (1) adaptive domain generalization, (2) adaptive model selection, and (3) adaptive knowledge utilization.Kindly check and confirm the edit made in the title.The title is correct.

  • Research Article
  • Cite Count Icon 84
  • 10.1109/tiv.2020.3039456
Night-to-Day: Online Image-to-Image Translation for Object Detection Within Autonomous Driving by Night
  • Nov 25, 2020
  • IEEE Transactions on Intelligent Vehicles
  • Mark Schutera + 4 more

Object detectors are central to autonomous driving and are widely used in driver assistance systems. Object detectors are trained on a finite amount of data within a specific domain, hampering detection performance when applying object detectors to samples from other domains during inference, an effect known as domain gap. Domain gap is a concern for data-driven applications, evoking repetitive retraining of networks when the applications unfold into other domains. With object detectors that have been trained on day images only, a domain gap can be observed in object detection by night. Training object detectors on night images is critical because of the enormous effort required to generate an adequate amount of diversely labeled data, and existing data sets often tend to overfit specific domain characteristics. For the first time, this work proposes adapting domains by online image-to-image translation to expand an object detector's domain of operation. The domain gap is decreased without additional labeling effort and without having to retrain the object detector while unfolding into the target domain. The approach follows the concept of domain adaptation, shifting the target domain samples into the domain knownto the object detector (source domain). Firstly, the UNIT network is trained for domain adaptation and subsequently cast into an online domain adaptation module, which narrows down the domain gap. Domain adaptation capabilities are evaluated qualitatively by displaying translated samples and visualizing the domain shift through the 2D tSNE algorithm. We quantitatively benchmark the domain adaptation's influence on a state-of-the-art object detector, and on a retrained object detector, for mean average precision, mean recall, and the resulting F1-score. Our approach achieves an F1 score improvement of 5.27 % within object detection by night when applying online domain adaptation. The evaluation is executed on the BDD100K benchmark data set.

  • Dissertation
  • 10.17760/d20581917
Unveiling the power of transfer learning towards efficient artificial intelligence
  • Jan 1, 2023
  • Can Qin

Large-scale models, abundant data, and dense computation are the pivotal pillars of deep neural networks. The present-day deep learning models have made significant strides in various areas such as Computer Vision (CV), Natural Language Processing (NLP), and Audio Signal Processing (ASP). These technological integrations have notably improved industrial automation while providing considerable enhancements to daily life. However, despite these advancements, deep learning still faces severe challenges in evolving into an efficient and accessible system. One of the major concerns is data efficiency due to the labor-intensive and costly process of annotated data. The other concern is model efficiency, impacting deployment costs and users' accessibility. Transfer Learning (TL) is a promising solution to address these challenges. TL harnesses the power of acquired data and pre-trained models to facilitate applications of new related tasks or smaller models. This dissertation is structured into three primary sections: Feature Transfer Learning, Model Transfer Learning, and Joint Transfer Learning. (1) Feature Transfer Learning (FTL), widely employed in Domain Adaptation (DA), utilizes a shared encoder model to learn universal representations through cross-domain feature alignment loss. It is primarily comprised of Unsupervised Domain Adaptation (UDA) and Semi-supervised Domain Adaptation (SSDA), depending on target label accessibility. The principal technical challenges with FTL involve distribution mismatch across domains and overfitting toward labeled data. To address these issues, this dissertation proposes structural regularization and multi-level alignment. (2) Model Transfer Learning (MTL) focuses on parameter tuning based on pre-trained models for novel tasks. An exemplary application of MTL is Knowledge Distillation (KD), which facilitates knowledge transfer from larger to smaller models for compression. This dissertation introduces a graph-based KD framework that enables real-time graph retrieval. In addition, with the surge of foundation models necessitating efficiency during finetuning, Parameter-Efficient Model Finetuning (PEFT) has received prominence. PEFT has been applied here to enrich a pre-trained tabular model's capacity by injecting external prior knowledge. (3) Joint Transfer Learning (JTL) synergizes FTL and MTL, necessitating both cross-domain feature alignments and parameter tuning. JTL is particularly suitable for instance alignment across different modalities, which helps to build multimodal models without --Author's abstract

  • Research Article
  • 10.1007/978-3-031-94562-5_25
Investigating the Domain Adaptability of General-Purpose Foundation Models for Left Atrium Segmentation from MR Images.
  • Jan 1, 2025
  • Functional imaging and modeling of the heart : ... International Workshop, FIMH ..., proceedings. FIMH (Conference)
  • Bipasha Kundu + 3 more

Segmentation of the left atrium (LA) is crucial for characterizing and appraising left atrial anatomy, morphology, and function in the context of a series of diseases, the most prevalent one being atrial fibrillation (AFib). Despite significant advances in deep learning-based segmentation models, their dependency on large annotated datasets for training limits their effectiveness in niche applications such as atrium segmentation, where annotated data is scarce. Pre-trained foundation models, trained on large-scale general-purpose datasets in a self-supervised manner, can offer an advantage by providing transferable features and enabling adoption to data-scarce domains. In this work, we explore the domain adaptability and robustness of some pre-trained foundation models, such as DINOv2, SAM, and MedSAM, as powerful alternatives for LA segmentation from MRI images. We integrated a modified UNet decoder that leverages the global contextual features encoded by the foundation models. Our approach is evaluated on the 2022 LAScarQS and 2018 LASC segmentation challenge datasets for end-to-end fine-tuning and lower training data settings, respectively. The performance of the UNet decoder was superior to that of the linear decoder used in the original papers of these foundation models, as well as other UNet baselines. Notably, DINOv2 combined with a UNet decoder consistently outperforms the baselines and improves Dice (91.5%, 91.6%) and IoU scores (84.5%, 86.6%), highlighting the model's generalizability and robustness across diverse datasets and limited training data. This study also underscores the transformative potential of foundation models in medical image segmentation, paving the way for more generalized and adaptable solutions across various medical applications.

  • Conference Article
  • Cite Count Icon 48
  • 10.24963/ijcai.2020/455
Unsupervised Domain Adaptation with Dual-Scheme Fusion Network for Medical Image Segmentation
  • Jul 1, 2020
  • Danbing Zou + 2 more

Domain adaptation aims to alleviate the problem of retraining a pre-trained model when applying it to a different domain, which requires large amount of additional training data of the target domain. Such an objective is usually achieved by establishing connections between the source domain labels and target domain data. However, this imbalanced source-to-target one way pass may not eliminate the domain gap, which limits the performance of the pre-trained model. In this paper, we propose an innovative Dual-Scheme Fusion Network (DSFN) for unsupervised domain adaptation. By building both source-to-target and target-to-source connections, this balanced joint information flow helps reduce the domain gap to further improve the network performance. The mechanism is further applied to the inference stage, where both the original input target image and the generated source images are segmented with the proposed joint network. The results are fused to obtain more robust segmentation. Extensive experiments of unsupervised cross-modality medical image segmentation are conducted on two tasks -- brain tumor segmentation and cardiac structures segmentation. The experimental results show that our method achieved significant performance improvement over other state-of-the-art domain adaptation methods.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.2196/52730
Using Domain Adaptation and Inductive Transfer Learning to Improve Patient Outcome Prediction in the Intensive Care Unit: Retrospective Observational Study.
  • Aug 21, 2024
  • Journal of medical Internet research
  • Maruthi Kumar Mutnuri + 3 more

Accurate patient outcome prediction in the intensive care unit (ICU) can potentially lead to more effective and efficient patient care. Deep learning models are capable of learning from data to accurately predict patient outcomes, but they typically require large amounts of data and computational resources. Transfer learning (TL) can help in scenarios where data and computational resources are scarce by leveraging pretrained models. While TL has been widely used in medical imaging and natural language processing, it has been rare in electronic health record (EHR) analysis. Furthermore, domain adaptation (DA) has been the most common TL method in general, whereas inductive transfer learning (ITL) has been rare. To the best of our knowledge, DA and ITL have never been studied in-depth in the context of EHR-based ICU patient outcome prediction. This study investigated DA, as well as rarely researched ITL, in EHR-based ICU patient outcome prediction under simulated, varying levels of data scarcity. Two patient cohorts were used in this study: (1) eCritical, a multicenter ICU data from 55,689 unique admission records from 48,672 unique patients admitted to 15 medical-surgical ICUs in Alberta, Canada, between March 2013 and December 2019, and (2) Medical Information Mart for Intensive Care III, a single-center, publicly available ICU data set from Boston, Massachusetts, acquired between 2001 and 2012 containing 61,532 admission records from 46,476 patients. We compared DA and ITL models with baseline models (without TL) of fully connected neural networks, logistic regression, and lasso regression in the prediction of 30-day mortality, acute kidney injury, ICU length of stay, and hospital length of stay. Random subsets of training data, ranging from 1% to 75%, as well as the full data set, were used to compare the performances of DA and ITL with the baseline models at various levels of data scarcity. Overall, the ITL models outperformed the baseline models in 55 of 56 comparisons (all P values <.001). The DA models outperformed the baseline models in 45 of 56 comparisons (all P values <.001). ITL resulted in better performance than DA in terms of the number of times and the margin with which it outperformed the baseline models. In 11 of 16 cases (8 of 8 for ITL and 3 of 8 for DA), TL models outperformed baseline models when trained using 1% data subset. TL-based ICU patient outcome prediction models are useful in data-scarce scenarios. The results of this study can be used to estimate ICU outcome prediction performance at different levels of data scarcity, with and without TL. The publicly available pretrained models from this study can serve as building blocks in further research for the development and validation of models in other ICU cohorts and outcomes.

  • Research Article
  • 10.64898/2026.04.23.26351616
Multimodal prediction of visual improvement in diabetic macular edema using real-world electronic health records and optical coherence tomography images
  • Apr 24, 2026
  • medRxiv
  • Siqi Sun + 10 more

Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance.In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45–0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54–0.65]), followed by EHR+EyeCLIP (0.57 [0.52–0.62]) and EHR+VisionFM (0.56 [0.51–0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan–Meier curves.Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences.These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.

  • Book Chapter
  • 10.5772/intechopen.1011584
Domain Adaptation in Multimodal Models
  • Nov 6, 2025
  • Raghavendran Ramakrishnan

Multimodal AI systems have the ability to represent and relate information in multiple data modalities. This enables a wide range of new applications when deployed in real world. They also face major challenges particularly due to domain shift arising from differences in the data they are trained on and the environment they encounter. This chapter focuses on domain adaptation for these systems. It involves assisting models to adjust to new conditions across different input modalities, such as text, images, or sensor data. Unlike single-modality systems, multimodal models must deal with unique challenges. This includes mismatch in data types, data quality, and conflicts during training across multiple modalities. We introduce these unique challenges with the multimodal systems. We also provide theoretical foundations using risk minimization and measures of divergence under domain shift across different modalities. From there, we also explore practical approaches such as adversarial training to align features, contrastive learning, and flexible fusion techniques that adjust when some inputs are unreliable. We also explore advanced techniques such as deployment-stage adaptation and fine-tuning large foundation models without complete retraining. We provide a summary of different datasets developed for multimodal domain adaptation. The chapter ends with summarizing key insights discussed throughout the chapter along with highlighting emerging opportunities in the domain.

  • Research Article
  • 10.1007/s13534-025-00535-y
Multimodal vision-language models in chest x-ray analysis: a study of generalization, supervision, and robustness.
  • Nov 25, 2025
  • Biomedical engineering letters
  • Batoul Aljaddouh + 2 more

Multimodal vision-language models (VLMs) are increasingly applied to medical imaging, yet systematic evaluations comparing them with unimodal models across datasets, supervision regimes, and clinical domains remain scarce. Prior studies often focus on a single dataset, specific pathologies, or one supervision setting, leaving unclear how these models generalize under realistic variability. We conduct a systematic evaluation of six leading unimodal and multimodal models for chest X-ray (CXR) classification using four widely adopted datasets: MIMIC-CXR, CheXpert, NIH-14, and PadChest. We assess model behavior in both zero-shot (ZS) and fine-tuned (FT) configurations, with a focus on generalization across pathologies, datasets, and linguistic domains. Our findings show that pretrained multimodal models such as CheXzero and CXR-LLaVA perform strongly in zero-shot scenarios, especially on out-of-distribution data, reflecting their capacity for semantic generalization. However, their performance tends to decline after fine-tuning in cross-lingual or noisy-label contexts, indicating susceptibility to overfitting. In contrast, unimodal models gain substantially from supervised fine-tuning, especially on in-domain data. Limitations include evaluation on seven shared pathologies, CXR imaging only, and use of publicly available pretrained models, which may restrict generalization to other clinical tasks. These findings highlight key trade-offs between generalization, robustness, and adaptability, and suggest promise in hybrid training strategies that integrate multimodal priors with targeted domain supervision.

  • Research Article
  • Cite Count Icon 20
  • 10.1097/cm9.0000000000003489
Artificial intelligence in medical imaging: From task-specific models to large-scale foundation models.
  • Feb 26, 2025
  • Chinese medical journal
  • Yueyan Bian + 4 more

Artificial intelligence (AI), particularly deep learning, has demonstrated remarkable performance in medical imaging across a variety of modalities, including X-ray, computed tomography (CT), magnetic resonance imaging (MRI), ultrasound, positron emission tomography (PET), and pathological imaging. However, most existing state-of-the-art AI techniques are task-specific and focus on a limited range of imaging modalities. Compared to these task-specific models, emerging foundation models represent a significant milestone in AI development. These models can learn generalized representations of medical images and apply them to downstream tasks through zero-shot or few-shot fine-tuning. Foundation models have the potential to address the comprehensive and multifactorial challenges encountered in clinical practice. This article reviews the clinical applications of both task-specific and foundation models, highlighting their differences, complementarities, and clinical relevance. We also examine their future research directions and potential challenges. Unlike the replacement relationship seen between deep learning and traditional machine learning, task-specific and foundation models are complementary, despite inherent differences. While foundation models primarily focus on segmentation and classification, task-specific models are integrated into nearly all medical image analyses. However, with further advancements, foundation models could be applied to other clinical scenarios. In conclusion, all indications suggest that task-specific and foundation models, especially the latter, have the potential to drive breakthroughs in medical imaging, from image processing to clinical workflows.

  • Research Article
  • Cite Count Icon 62
  • 10.1016/j.simpat.2023.102754
Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impacts
  • Mar 22, 2023
  • Simulation Modelling Practice and Theory
  • Adam Kolides + 8 more

Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impacts

  • Research Article
  • 10.1049/cvi2.70009
Foundation Model Based Camouflaged Object Detection
  • Jan 1, 2025
  • IET Computer Vision
  • Zefeng Chen + 3 more

ABSTRACTCamouflaged object detection (COD) aims to identify and segment objects that closely resemble and are seamlessly integrated into their surrounding environments, making it a challenging task in computer vision. COD is constrained by the limited availability of training data and annotated samples, and most carefully designed COD models exhibit diminished performance under low‐data conditions. In recent years, there has been increasing interest in leveraging foundation models, which have demonstrated robust general capabilities and superior generalisation performance, to address COD challenges. This work proposes a knowledge‐guided domain adaptation (KGDA) approach to tackle the data scarcity problem in COD. The method utilises the knowledge descriptions generated by multimodal large language models (MLLMs) for camouflaged images, aiming to enhance the model's comprehension of semantic objects and camouflaged scenes through highly abstract and generalised knowledge representations. To resolve ambiguities and errors in the generated text descriptions, a multi‐level knowledge aggregation (MLKG) module is devised. This module consolidates consistent semantic knowledge and forms multi‐level semantic knowledge features. To incorporate semantic knowledge into the visual foundation model, the authors introduce a knowledge‐guided semantic enhancement adaptor (KSEA) that integrates the semantic knowledge of camouflaged objects while preserving the original knowledge of the foundation model. Extensive experiments demonstrate that our method surpasses 19 state‐of‐the‐art approaches and exhibits strong generalisation capabilities even with limited annotated data.

  • Research Article
  • Cite Count Icon 2
  • 10.1158/1538-7445.am2024-4905
Abstract 4905: Multimodal transformer model improves survival prediction in lung cancer compared to unimodal approaches
  • Mar 22, 2024
  • Cancer Research
  • Aakash Tripathi + 3 more

Integrating multimodal lung data including clinical notes, medical images, and molecular data is critical for predictive modeling tasks like survival prediction, yet effectively aligning these disparate data types remains challenging. We present a novel method to integrate heterogeneous lung modalities by first thoroughly analyzing various domain-specific models and selecting the optimal model for embedding feature extraction per data type based on performance on representative pretrained tasks. For clinical notes, the GatorTron models showed the lowest regression loss on an initial evaluation set, with the large GatorTron-medium model achieving 12.9 loss. After selecting the top performers, we extracted robust embeddings on the full lung dataset built using the Multimodal Integration of Oncology Data System (MINDS) framework. MINDS provides an end-to-end platform for aggregating and normalizing multimodal patient data. We aligned the multimodal embeddings to a central pre-trained language model using contrastive representation learning based on a cosine similarity loss function. To adapt the language model to the new modalities, we employed a parameter-efficient tuning method called adapter tuning, which introduces small trainable adapter layers that leave the base model weights frozen. This avoids catastrophic forgetting of the pretrained weights. We evaluated our multimodal model on prognostic prediction tasks including survival regression and subtype classification using both public and internal lung cancer datasets spanning multiple histologic subtypes and stages. Our aligned multimodal model demonstrated improved performance over models utilizing only single modalities, highlighting the benefits of integrating complementary information across diverse lung data types. This work illustrates the potential of flexible multimodal modeling for critical lung cancer prediction problems using heterogeneous real-world patient data. Our model provides a strong foundation for incorporating emerging data types, modalities, and predictive tasks in the future. Citation Format: Aakash Tripathi, Asim Waqas, Yasin Yilmaz, Ghulam Rasool. Multimodal transformer model improves survival prediction in lung cancer compared to unimodal approaches [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 4905.

  • Research Article
  • Cite Count Icon 122
  • 10.1016/j.media.2022.102457
Source free domain adaptation for medical image segmentation with fourier style mining.
  • Jul 1, 2022
  • Medical Image Analysis
  • Chen Yang + 3 more

Source free domain adaptation for medical image segmentation with fourier style mining.

  • Dissertation
  • 10.63028/10067/2122260151162165141
Domain adaptation for applications in computer vision with limited data
  • Jan 1, 2024
  • Mattias Billast

This dissertation explores the challenges and solutions of using domain adaptation in real-time computer vision applications with limited labeled data. Computer vision, initially based on traditional feature extraction methods, has progressed significantly with deep learning, achieving breakthroughs in areas like image classification and object detection. However, deep learning models often require large labeled datasets, which can be expensive and time-consuming to obtain, especially for custom applications. Domain adaptation offers a way to tackle this problem by using external data sources to improve model performance in a target domain. Its effectiveness depends on the domain gap—if the gap between source and target data is too large, adaptation becomes difficult. The dissertation focuses on two applications: maritime autonomous navigation and human motion prediction. In maritime navigation, the goal is to detect, track, and locate obstacles for autonomous vessels, but a lack of labeled data poses a challenge. By using domain adaptation techniques, data from external sources (such as public object detection datasets) is leveraged to improve model accuracy. The second application involves predicting the physical and cognitive ergonomics of operators performing repetitive tasks. This is done by analyzing human pose data and anticipating movements to prevent musculoskeletal issues. Data from a VR setup helps train the model, with domain adaptation used to improve its performance despite limited labeled data. Both applications require real-time performance with lightweight models. Domain adaptation techniques are used to enhance the models by incorporating external data, like maritime object detection datasets or VR controller data for human pose prediction. Overall, the thesis highlights the importance of domain adaptation in improving model accuracy with limited data, showing that external data sources can significantly enhance real-time computer vision applications, both in real-world and academic settings. The key contribution is that domain adaptation can utilize any useful external data to improve performance.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant