Related Topics
Articles published on multimodal-transformer
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
480 Search results
Sort by Recency
- Research Article
- 10.1158/1538-7445.am2025-3652
- Apr 21, 2025
- Cancer Research
- Yubin Xie + 23 more
Abstract Background: Understanding cellular behavior within the context of the tumor-immune microenvironment is essential to developing next-generation cancer therapies and advancing precision medicine. The inherent challenges of this problem reflect the complexity of human biology - patient and tissue heterogeneity, a multitude of interacting signaling pathways, dynamic short- and long-range interactions between the tumor and the immune system, and the intrinsic limitations of measurement techniques. Machine Learning foundation models trained on multimodal patient data present an opportunity to grapple with this complexity and push the field forward leveraging recent advances in spatial biology. Method: A custom multimodal transformer was trained on 1399 primary resections from lung cancer patients profiled via H&E, CosMx spatial transcriptomics, whole-exome sequencing, and a custom multiplex immunofluorescence panel. To our knowledge this is the largest extant spatial transcriptomics dataset, comprising more than 40 million cells. The transformer was trained via self-supervised learning to predict expression of each CosMx panel gene for a single cell, conditioned on both spatially proximal and patient-level data from all modalities. This training task induces the model to learn fundamental rules that govern cell state and cell-cell interactions within the context of disease. The resulting model, Celleporter, can generate gene expression in a “virtual” cell at a particular location within real or simulated patient tissue. Counterfactual simulations with modified patient data can be used to predict the effects of genetic alterations, gene expression changes, or external interventions on the tumor-immune microenvironment. Result: Celleporter accurately predicted spatial gene expression patterns from sparsely sampled data, resolving the limitations of traditional experimental approaches. Virtual cell simulations reproduced distinct biological states, such as cytotoxic and naïve transitions of CD8+ T cells within and outside tumor regions and reproduced foundational immunology, including the relationship between MHC-I and T cell activation. Comparative analyses across patient cohorts identified immune-suppressive mechanisms in STK11-mutant tumors resistant to immunotherapy. Perturbation simulations highlighted therapeutic targets predicted to restore cytotoxic activity in STK11-mutant tumor microenvironments. Conclusion: This study demonstrates that a self-supervised foundation model trained on large-scale multimodal patient data can learn fundamental aspects of cancer immunology, and accurately reproduce the impact of the tumor-immune microenvironment on cell state in a patient-specific manner. This flexible system for interrogating cell and tissue biology has direct application to patient stratification and target discovery. Citation Format: Yubin Xie, Eshed Margalit, Tyler Van Hensbergen, Dexter Antonio, Jake Schmidt, Yu Phoebe Guo, Jérémie Decalf, Maxime Dhainaut, Lucas Cavalcante, Hargita Kaplan, Rodney Collins, Francis Fernandez, Rob Schiemann, Eric Siefkas, Michela Meister, Joy Tea, Carl Ebeling, Anastasia Mavropoulos, Nicole Snell, Shafique Virani, Ron Alfa, Lacey Padron, Jacob Rinaldi, Daniel Bear. Celleporter: a foundation model of cell and tissue biology with application to patient stratification and target discovery [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 3652.
- Research Article
- 10.3390/ai6040075
- Apr 11, 2025
- AI
- Renas Mukhametzianov + 1 more
The rise of large-scale language models and multimodal transformers has enabled instruction-based policies, such as vision-and-language navigation. To leverage their general world knowledge, we propose multimodal annotations for action options and support selection from a dynamic, describable action space. Our framework employs a multimodal transformer that processes front-facing camera images, light detection and ranging (LIDAR) sensor’s point clouds, and tasks as textual instructions to produce a history-aware decision policy for mobile robot navigation. Our approach leverages a pretrained vision–language encoder and integrates it with a custom causal generative pretrained transformer (GPT) decoder to predict action sequences within a state–action history. We propose a trainable attention score mechanism to efficiently select the most suitable action from a variable set of possible options. Action options are text–image pairs and encoded using the same multimodal encoder employed for environment states. This approach of annotating and dynamically selecting actions is applicable to broader multidomain decision-making tasks. We compared two baseline models, ViLT (vision-and-language transformer) and FLAVA (foundational language and vision alignment), and found that FLAVA achieves superior performance within the constraints of 8 GB video memory usage in the training phase. Experiments were conducted in both simulated and real-world environments using our custom datasets for instructed task completion episodes, demonstrating strong prediction accuracy. These results highlight the potential of multimodal, dynamic action spaces for instruction-based robot navigation and beyond.
- Research Article
2
- 10.1609/aaai.v39i5.32497
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
- Hao Li + 4 more
Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model’s social intelligence level. While impressive multiple-choice question (MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model’s interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to offer a new perspective on Social-IQ and advance the development of human-like social AI.
- Research Article
3
- 10.1609/aaai.v39i2.32136
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
- Ryo Masumura + 7 more
This paper presents a novel method for automatically recognizing people's apparent personality traits as perceived by others. In previous studies, apparent personality trait recognition from multimodal human behavior is often modeled to directly estimate personality trait scores, i.e., the ``Big Five'' scores. In the model training phase, ground-truth personality trait scores were often determined from personality test results scored by many other people using fine-grained questionnaires, however, rich information in the personality test results have not been leveraged for anything other than determining the ground-truth Big Five scores. The scores assigned to each questionnaire item are thought to include more meta-level differences in personality characteristics. Therefore, we propose joint modeling methods that can estimate not only the Big Five scores but also questionnaire item-level scores. This enables us to improve awareness of multimodal human behavior. In addition, we present a newly created self-introduction video dataset with 50-item Big Five questionnaire results since previous apparent personality trait recognition datasets do not provide such personality test results. Experiments using the created dataset demonstrate that our proposed joint modeling methods with a multimodal transformer backbone can improve to estimate Big Five scores and effectively estimate questionnaire item-level scores. We also verify that the estimation performance reached human evaluation performance.
- Research Article
3
- 10.1364/oe.555722
- Apr 7, 2025
- Optics express
- Linwei Shang + 7 more
Raman spectroscopy has been proved to have the potential to accurate diagnose a variety of diseases, and what we believe to be novel Raman probes or instruments for clinical applications were constantly developed. However, biological tissues are usually structurally complex. so that the Raman signals collected in vivo may come from a variety of chemical components, even different tissues. This work proposed a Raman spectral unmixing approach, which can separate the signals of different tissues from their mixed spectra. Specifically, multimodal frequency and time-frequency transformation were performed together to extract the different features of mixed spectra. An attention U-net model was introduced to predict the spectra of target tissues in each modality. Then multimodal fusion was conducted to filter and integrate effective information from the above modalities and obtained accurate unmixed spectra. Canine knee joints with osteoarthritis were selected as the research subject, and the spectra of subchondral bone and cartilage were successfully separated from their mixed spectra, which can be further applied in osteoarthritis research just like the actual measured spectra. This work will contribute to biological in vivo detection of Raman probes or instruments, enabling them to separate signals from different tissues, structures, and even biochemical molecular components, achieving more accurate prediction and diagnosis.
- Research Article
11
- 10.1038/s41467-025-58499-7
- Apr 4, 2025
- Nature Communications
- Junwu Chen + 4 more
The fast assessment of the global minimum adsorption energy (GMAE) between catalyst surfaces and adsorbates is crucial for large-scale catalyst screening. However, multiple adsorption sites and numerous possible adsorption configurations for each surface/adsorbate combination make it prohibitively expensive to calculate the GMAE through density functional theory (DFT). Thus, we designed a multi-modal transformer called AdsMT to rapidly predict the GMAE based on surface graphs and adsorbate feature vectors without site-binding information. The AdsMT model effectively captures the intricate relationships between adsorbates and surface atoms through the cross-attention mechanism, hence avoiding the enumeration of adsorption configurations. Three diverse benchmark datasets were introduced, providing a foundation for further research on the challenging GMAE prediction task. Our AdsMT framework demonstrates excellent performance by adopting the tailored graph encoder and transfer learning, achieving mean absolute errors of 0.09, 0.14, and 0.39 eV, respectively. Beyond GMAE prediction, AdsMT’s cross-attention scores showcase the interpretable potential to identify the most energetically favorable adsorption sites. Additionally, uncertainty quantification was integrated into our models to enhance the trustworthiness of the predictions.
- Research Article
1
- 10.1109/jbhi.2024.3496700
- Apr 1, 2025
- IEEE journal of biomedical and health informatics
- Baiying Lei + 11 more
The sellar region tumor is a brain tumor that only exists in the brain sellar, which affects the central nervous system. The early diagnosis of the sellar region tumor subtypes helps clinicians better understand the best treatment and recovery of patients. Magnetic resonance imaging (MRI) has proven to be an effective tool for the early detection of sellar region tumors. However, the existing sellar region tumor diagnosis still remains challenging due to the small amount of dataset and data imbalance. To overcome these challenges, we propose a novel self-supervised multi-scale multi-modal graph pool Transformer (MMGPT) network that can enhance the multi-modal fusion of small and imbalanced MRI data of sellar region tumors. MMGPT can strengthen feature interaction between multi-modal images, which makes our model more robust. A contrastive learning equipped auto-encoder (CAE) via self-supervised learning (SSL) is adopted to learn more detailed information between different samples. The proposed CAE transfers the pre-trained knowledge to the downstream tasks. Finally, a hybrid loss is equipped to relieve the performance degradation caused by data imbalance. The experimental results show that the proposed method outperforms state-of-the-art methods and obtains higher accuracy and AUC in the classification of sellar region tumors.
- Research Article
- 10.1227/neu.0000000000003360_185
- Apr 1, 2025
- Neurosurgery
- Zhuoyuan Li + 3 more
INTRODUCTION: Missing MRI sequences due to limited scan time, image artifacts, scan corruption, allergies to contrast agents, etc. in real clinical environments limit the availability of deep learning (DL) brain tumor segmentation models. Currently, there is no systematic evaluation of the impact of missing sequences on DL-based brain tumor segmentation models, nor is there a unified approach to address these diverse scenarios. METHODS: We proposed a pipeline for brain tumor segmentation in various MRI sequence missing scenarios. A novel unpaired multi-modal generative adversarial transformer (UMMGAT) was designed to perform image-to-image translation among MRI sequences from the BRATS dataset (335 cases) and local dataset (92 cases). The generated images were then used to replace missing ones as inputs for a multi-modal segmentation network. RESULTS: The UMMGAT can be effectively trained with unpaired data to perform arbitrary image-to-image translations among multi-center multi-sequence MR images. The median DSCs of the brain tumor segmentation are significantly improved by using generated images vs copied images under most of the scenarios. Notably, in scenarios where Flair, T1, T2, T1ce, Flair&T1ce, and T1&T1ce sequences were absent, using gFlair_from_T2 (generated Flair from T2), gT1_from_T2, gT2_from_T1, gT1ce_from_T1, gFlair_from_T2 & gT1ce_from_T1, and gT1_from_T2 & gT1ce_from_T2 vs using copied source sequences for substitution, the median DSCs of WT were 0.734(0.591 - 0.832) vs 0.554(0.341 - 0.689) , 0.905(0.774 - 0.927) vs 0.854(0.704 - 0.911) , 0.865(0.742 - 0.908) vs 0.781(0.515 - 0.888) , 0.894(0.794 - 0.917) vs 0.849(0.687 - 0.914) , 0.71(0.536 - 0.81) vs 0.436(0.144 - 0.615) , and 0.897(0.756 - 0.919) vs 0.721(0.512 - 0.858)(p<0.0001). CONCLUSIONS: The proposed UMMGAT can synthesize high-fidelity images through training on unpaired datasets. The generated images exhibit potential to enhance the DL brain tumor segmentation model in simulated sequence missing scenarios.
- Research Article
12
- 10.1016/j.compbiomed.2025.109721
- Apr 1, 2025
- Computers in biology and medicine
- Belinda Lokaj + 9 more
Breast cancer is the most common cancer worldwide, and magnetic resonance imaging (MRI) constitutes a very sensitive technique for invasive cancer detection. When reviewing breast MRI examination, clinical radiologists rely on multimodal information, composed of imaging data but also information not present in the images such as clinical information. Most machine learning (ML) approaches are not well suited for multimodal data. However, attention-based architectures, such as Transformers, are flexible and therefore good candidates for integrating multimodal data. The aim of this study was to develop and evaluate a novel multimodal deep learning (DL) model combining ultrafast dynamic contrast-enhanced (UF-DCE) MRI images, lesion characteristics and clinical information for breast lesion classification. From 2019 to 2023, UF-DCE breast images and radiology reports of 240 patients were retrospectively collected from a single clinical center and annotated. Imaging data were constituted of volumes of interest (VOI) extracted around segmented lesions. Non-imaging data were constituted of both clinical (categorical) and geometrical (scalar) data. Clinical data were extracted from annotated reports and were associated to their corresponding lesions. We compared the diagnostic performances of traditional ML methods for non-imaging data, an image model based on the DL architecture, and a novel Transformer-based architecture, the Multimodal Sieve Transformer with Vision Transformer encoder (MMST-V). The final dataset included 987 lesions (280 benign, 121 malignant lesions, and 586 benign lymph nodes) and 1081 reports. For classification with non-imaging data, scalar data had a greater influence on performances of lesion classification (Area under the receiver operating characteristic curve (AUROC)=0.875±0.042) than categorical data (AUROC=0.680±0.060). MMST-V achieved better performances (AUROC=0.928±0.027) than classification based on non-imaging data (AUROC=0.900±0.045), and imaging data only (AUROC=0.863±0.025). The proposed MMST-V is an adaptative approach that can consider redundant information provided by multimodal information. It demonstrated better performances than unimodal methods. Results highlight that the combination of clinical patient data and detailed lesion information as additional clinical knowledge enhances the diagnostic performances of UF-DCE breast MRI.
- Research Article
10
- 10.1007/s11229-025-04961-4
- Mar 27, 2025
- Synthese
- Xabier E Barandiaran + 1 more
This paper introduces the concept of “generative midtended cognition”, that explores the integration of generative AI technologies with human cognitive processes. The term “generative” reflects AI’s ability to iteratively produce structured outputs, while “midtended” captures the potential hybrid (human-AI) nature of the process. It stands between traditional conceptions of intended creation, understood as steered or directed from within, and extended processes that bring exo-biological processes into the creative process. We examine the working of current generative technologies (based on multimodal transformer architectures typical of large language models like ChatGPT) to explain how they can transform human cognitive agency beyond what the conceptual resources of standard theories of extended cognition can capture. We suggest that the type of cognitive activity typical of the coupling between a human and generative technologies is closer (but not equivalent) to social cognition than to classical extended cognitive paradigms. Yet, it deserves a specific treatment. We provide an explicit definition of generative midtended cognition in which we treat interventions by AI systems as constitutive of the agent’s intentional creative processes. Furthermore, we distinguish two dimensions of generative hybrid creativity: 1. Width: captures the sensitivity of the context of the generative process (from the single letter to the whole historical and surrounding data), 2. Depth: captures the granularity of iteration loops involved in the process. Generative midtended cognition stands in the middle depth between conversational forms of cognition in which complete utterances or creative units are exchanged, and micro-cognitive (e.g. neural) subpersonal processes. Finally, the paper discusses the potential risks and benefits of widespread generative AI adoption, including the challenges of authenticity, generative power asymmetry, and creative boost or atrophy.
- Research Article
- 10.52783/cana.v32.4547
- Mar 26, 2025
- Communications on Applied Nonlinear Analysis
- Janardhan Komarolu, C.Nagaraju
As more people rely on biometrics in conjunction with traditional identity authentication systems, attacks like deepfakes and adversarial techniques have emerged as prominent threats in several identity verification systems. Traditional unimodal and even static multimodal schemes are generally found ineffective in the face of new attacks because unimodality leaves them vulnerable to different adversarial manipulations, cannot verify continuously during the use phase, and lack adaptability to change. Therefore, in light of these limitations, we present an adaptive, robust, and privacy-preserving biometric security framework based on a combination of various transformer-based models with ensemble learning strategies. Our work, the Adaptive Multi-Modal Feature Fusion Transformer (AMFFT), dynamically integrates attention-based facial, voice, and fingerprint features to optimize security through a context-oriented fusion application. To assure enhanced privacy and robustness, we will also integrate Differentially Private Adversarial Training (DPAT), interested in lessening the effect of model inversion and spoofing using adversarial techniques. Therefore, our Spoof-Resistant Multimodal Attention Transformer (SMA-Transformer) detects deepfake and synthetic attacks by consistency between modalities, ensuring the co-operation of biometric signals. In addition to that, the Ensemble Learning with Zero-Trust Verification Model (EZV-Model) is responsible for continuous authentication by real-time analysis of biometrics scores and behavior traits. Finally, the Real-Time Behavioral Biometric Security Model (RBB-Sec) can detect advanced impersonation scenarios based on micro-expressions, keystroke dynamics, and voice stress patterns. In combination with the above techniques, the proposed framework guarantees a significant improvement in performance regarding authentication accuracy (≥99.5%), spoof detection (≥99.3%), and adversarial robustness (≤1.2% evasion rate), while maintaining low false rejection rates (≤1.5%). By integrating adaptive biometric fusion, deepfake-resistant verification, and zero-trust-based continuous authentication, this work lays an advanced security paradigm against emerging cyber threats for biometric security systems.
- Preprint Article
- 10.21203/rs.3.rs-6297243/v1
- Mar 26, 2025
- Research Square
- Filip Dahlén + 7 more
Abstract Cutaneous melanoma is an aggressive form of skin cancer. Knowledge if a primary melanoma is likely to metastasize is crucial for treatment and survival prediction of melanoma patients. We aimed to develop a predictive tool for determining metastatic potential in primary melanomas utilizing a weakly supervised vision language model. A total of 426 routine stained whole slide images (WSI), along with corresponding histopathological features (Breslow thickness, diameter, presence of dermal mitoses, ulceration and regression), were collected. Of these, 341 samples were used for training and validation, while 85 were reserved as a holdout test set. WSIs were split into patches, and feature embeddings were extracted using Prov-GigaPath. Histopathological features were converted to text, with embeddings generated by BiomedBERT. We developed a multimodal transformer integrating WSIs and histopathological features and conducted an ablation study comparing it to (1) TransMIL using only WSIs and (2) an MLP using only histopathological features. Each model employed a bagging ensemble with five cross-validation models. The multimodal transformer achieved an AUC of 0.887, slightly higher than TransMIL (0.883) and notably better than BertMLP (0.800), highlighting the benefit of including imaging and clinical data for early recognition of melanomas with high metastatic potential.
- Research Article
1
- 10.1088/1741-2552/adbec0
- Mar 25, 2025
- Journal of Neural Engineering
- Jingwei Zhang + 6 more
Objective. Tonic-clonic seizures (TCSs), which present a significant risk for sudden unexpected death in epilepsy, require accurate detection to enable effective long-term monitoring. Previous studies have demonstrated the advantages of multimodal seizure detection systems in reliably detecting TCSs over extended periods. However, the effectiveness of these data-driven systems depends heavily on the availability of reliable training data.Approach. To address this need, we propose an innovative data selection method designed to identify high-quality training samples. Our approach evaluates sample quality based on learning difficulty, classifying samples with lower learning difficulty as higher quality. We then introduce a confidence-based method to quantify the proportion of high-quality samples within the dataset.Main results. Experimental results show that our method improves the performance of a state-of-the-art TCS detection model by 11%.Significance. Using this data selection method, we develop a training pipeline that enhances the training process of multimodal seizure detection models.
- Preprint Article
- 10.20944/preprints202503.1265.v1
- Mar 18, 2025
- Preprints.org
- Noah Brown + 3 more
Creating sophisticated machine learning models to comprehend interactions between individuals can lead to more intuitive user experiences for interactive systems like Amazon Alexa. Beyond basic indicators such as voice modulation and eye movement, a person's combined audio-visual expressions—including vocal intonation and facial gestures—act as subtle cues reflecting the level of engagement in a conversation. This research explores advanced deep learning techniques for the detection of user expressions through audio-visual data. Initially, we develop a foundational audio-visual model incorporating recurrent neural network layers, which demonstrates performance on par with existing leading methods. Subsequently, we introduce a novel transformer-based framework equipped with encoder layers that more effectively fuse audio and visual features for tracking expressions. Evaluation using the Aff-Wild2 dataset reveals that our proposed transformer models outperform the recurrent-based baseline by approximately 2% in accurately identifying arousal and valence metrics. Additionally, our multimodal transformer approaches exhibit notable enhancements compared to unimodal models, achieving performance improvements of up to 3.6%. Comprehensive ablation analyses confirm the crucial role of visual information in the accurate detection of expressions within the Aff-Wild2 dataset. These findings underscore the potential of transformer architectures in advancing the field of expression recognition and enhancing human-computer interaction systems.
- Research Article
2
- 10.3390/app15052862
- Mar 6, 2025
- Applied Sciences
- Dang-Khanh Nguyen + 4 more
Emotion recognition in video aims to estimate human emotions using acoustic, visual, and linguistic information. This problem is considered multimodal and requires learning different modalities, such as visual, verbal, and vocal cues. Although previous studies have focused on developing sophisticated deep learning models, this work proposes a different approach using dynamic restrained adaptive loss inspired by multitask learning to understand multimodal inputs jointly. This training strategy allows predictions from one modality to enhance the accuracy of predictions from other modalities, mirroring the concept of multitask learning, where the results of one task can improve the performance of related tasks. Furthermore, this work introduces the extended multimodal bottleneck transformer, an efficient and effective mid-fusion method designed for problems involving more than two modalities to enhance the performance of emotion recognition systems. The proposed method significantly improves results compared to other end-to-end multimodal fusion techniques on three multimodal benchmarks—Interactive Emotional Dyadic Motion Capture (IEMOCAP), Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), and the Chinese Multimodal Sentiment Analysis dataset with independent unimodal annotations (CH-SIMS).
- Research Article
13
- 10.1038/s41746-025-01530-4
- Mar 5, 2025
- npj Digital Medicine
- Yunsu Byeon + 10 more
Molecular subtyping and grading of adult-type diffuse gliomas are essential for treatment decisions and patient prognosis. We introduce GlioMT, an interpretable multimodal transformer that integrates imaging and clinical data to predict the molecular subtype and grade of adult-type diffuse gliomas according to the 2021 WHO classification. GlioMT is trained on multiparametric MRI data from an institutional set of 1053 patients with adult-type diffuse gliomas to predict the IDH mutation status, 1p/19q codeletion status, and tumor grade. External validation on the TCGA (200 patients) and UCSF (477 patients) shows that GlioMT outperforms conventional CNNs and visual transformers, achieving AUCs of 0.915 (TCGA) and 0.981 (UCSF) for IDH mutation, 0.854 (TCGA) and 0.806 (UCSF) for 1p/19q codeletion, and 0.862 (TCGA) and 0.960 (UCSF) for grade prediction. GlioMT enhances the reliability of clinical decision-making by offering interpretability through attention maps and contributions of imaging and clinical data.
- Research Article
- 10.1007/s43674-025-00079-9
- Mar 1, 2025
- Advances in Computational Intelligence
- Siddhanth U Hegde + 7 more
A meme is a part of media created to share an opinion or emotion across the internet. Due to their popularity, memes have become the new form of communication on social media. However, they are used in harmful ways such as trolling and cyberbullying progressively due to their nature. Various data modelling methods create different possibilities in feature extraction and turn them into beneficial information. The variety of modalities included in data plays a significant part in predicting the results. We try to explore the significance of visual features of images in classifying memes. Memes are a blend of both image and text, where the text is embedded into the picture. We consider a meme to be trolling if the meme in any way tries to troll a particular individual, group, or organisation. We try to incorporate the memes as a troll and non-trolling memes based on their images and text. We evaluate if there is any major significance of the visual features for identifying whether a meme is trolling or not. Our work illustrates different textual analysis methods and contrasting multimodal approaches ranging from simple merging to cross attention to utilising both worlds’—visual and textual features. The fine-tuned cross-lingual language model, XLM, performed the best in textual analysis, and the multimodal transformer performs the best in multimodal analysis.
- Research Article
- 10.3724/sp.j.1089.2023-00248
- Mar 1, 2025
- Journal of Computer-Aided Design & Computer Graphics
- Xinyu Xia + 4 more
Cross-modal retrieval takes one modality data as a query and retrieves semantically relevant data in another modality. Most existing cross-modal retrieval methods are designed for scenarios with complete modality data. However, in real-world applications, incomplete modality data often exists, which these methods struggle to handle effectively. In this paper, we propose a typical concept-driven modality-missing deep cross-modal retrieval model. Specifically, we first propose a multi-modal Transformer integrated with multi-modal pretraining networks, which can fully capture the multi-modal fine-grained semantic interaction in the incomplete modality data, extract multi-modal fusion semantics and construct cross-modal subspace, and at the same time supervise the learning process to generate typical concepts. In addition, the typical concepts are used as the cross-attention key and value to drive the training of the modal mapping network, so that it can adaptively preserve the implicit multi-modal semantic concepts of the query modality data, generate cross-modal retrieval features, and fully preserve the pre-extracted multi-modal fusion semantics. Experimental results on four benchmark cross-modal retrieval datasets—Wikipedia, Pascal-Sentence, NUSWIDE, and XmediaNet—show that our proposed method outperforms the existing baseline models in the paper, with average precision improvements of 1.7%, 5.1%, 1.6%, and 5.4%, respectively. The source code of our method is available at: https://gitee.com/MrSummer123/CPCMR.
- Research Article
9
- 10.1016/j.neucom.2025.129376
- Mar 1, 2025
- Neurocomputing
- Abdul Aziz + 4 more
MMTF-DES: A fusion of multimodal transformer models for desire, emotion, and sentiment analysis of social media data
- Research Article
4
- 10.1038/s41598-025-90115-y
- Feb 17, 2025
- Scientific Reports
- Quan Anh Duong + 2 more
Current deep learning methods for diagnosing Alzheimer’s disease (AD) typically rely on analyzing all or parts of high-resolution 3D volumetric features, which demand expensive computational resources and powerful GPUs, particularly when using multimodal data. In contrast, lightweight cortical surface representations offer a more efficient approach for quantifying AD-related changes across different cortical regions, such as alterations in cortical structures, impaired glucose metabolism, and the deposition of pathological biomarkers like amyloid- and tau. Despite these advantages, few studies have focused on diagnosing AD using multimodal surface-based data. This study pioneers a novel method that leverages multimodal, lightweight cortical surface features extracted from MRI and PET scans, providing an alternative to computationally intensive 3D volumetric features. Our model employs a middle-fusion approach with a cross-attention mechanism to efficiently integrate features from different modalities. Experimental evaluations on the ADNI series dataset, using T1-weighted MRI and Fluorodeoxyglucose PET, demonstrate that the proposed model outperforms volume-based methods in both early AD diagnosis accuracy and computational efficiency. The effectiveness of our model is further validated with the combination of T1-weighted MRI, A PET, and Tau PET scans, yielding favorable results. Our findings highlight the potential of surface-based transformer models as a superior alternative to conventional volume-based approaches.