Discovery Logo
Sign In
Search
Paper
Search Paper
R Discovery for Libraries Pricing Sign In
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
features
  • Audio Papers iconAudio Papers
  • Paper Translation iconPaper Translation
  • Chrome Extension iconChrome Extension
Content Type
  • Journal Articles iconJournal Articles
  • Conference Papers iconConference Papers
  • Preprints iconPreprints
  • Seminars by Cassyni iconSeminars by Cassyni
More
  • R Discovery for Libraries iconR Discovery for Libraries
  • Research Areas iconResearch Areas
  • Topics iconTopics
  • Resources iconResources

Related Topics

  • Spatial Transformation
  • Spatial Transformation

Articles published on Multimodal Transformer

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
471 Search results
Sort by
Recency
  • New
  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.media.2026.103966
Multimodal sparse fusion transformer network with spatio-temporal decoupling for breast tumor classification.
  • May 1, 2026
  • Medical image analysis
  • Jiahao Xu + 5 more

Multimodal sparse fusion transformer network with spatio-temporal decoupling for breast tumor classification.

  • New
  • Research Article
  • 10.3390/sym18050723
LEO Satellite Signals Optimized Interference Method with Multimodal Learning Transformer Model
  • Apr 24, 2026
  • Symmetry
  • Chengkai Tang + 4 more

Low-Earth orbit satellites are gradually becoming the core infrastructure of integrated aerospace communication networks, with their significant advantages of high communication rates, small transmission delay, and wide coverage. Interference with military communications in response to their security and protection needs is a current research challenge. Consequently, this paper introduces an interference technique optimized for low-Earth orbit satellite signals using a multimodal learning transformer model (OI-MLT). The proposed method incorporates symmetry-aware design by exploiting the inherent time–frequency structural characteristics of LEO satellite signals and the spatially distributed topology of interference sources. An optimized model for distributed interference sources is developed, and multimodal information of spectra and numerical values is processed in parallel through the self-attention mechanism. This approach effectively addresses the problem of dynamic matching between the interference signal and target signal in high-speed LEO scenarios, as well as high-precision interference synchronization under time-varying channels. Experimental results demonstrate that this technique enhances the precision of frequency tracking, reduces the time required for synchronization establishment, and improves the interference success rate by 27.52% on average compared with existing methods.

  • New
  • Research Article
  • 10.3846/jcem.2026.26154
An intelligent construction system based on digital twin and foundation model optimization
  • Apr 20, 2026
  • Journal of Civil Engineering and Management
  • Fengyi Guo + 4 more

Construction sites routinely face multi-trade concurrency, spatiotemporal coupling, and high safety risk; relying solely on manual inspection and heuristic scheduling often leads to lagging detection and inconsistent execution. In response, recent practice has introduced digital twins (DT) to fuse video, sensors, and BIM and thus improve site visibility; however, most implementations remain at monitoring/visualization, lacking a mechanism to convert cognition into executable, verifiable decisions. Meanwhile, Transformer foundation models show strong capabilities in multimodal perception and representation learning, yet they are rarely closed-looped with engineering constraints and on-site execution. Against this backdrop, taking high-rise self-climbing platform (SCP) operations as a representative scenario, we build a DT×Transformer closed-loop system. We align video/sensor/BIM/text at the component level via “Component-ID + Timestamp”, train a multimodal Transformer for operation-state recognition and short-horizon risk prediction, and then explicitly encode safety, resource, and spatial precedence constraints in a policy module to generate feasible task sequences, which are delivered to crews via AR with acknowledgments to close the loop. The system integrates multisource perception, digital twin, foundation-model reasoning, and AR-assisted execution, and was validated on a highrise self-climbing platform project for its overall improvement of construction performance. The evaluation covered four key aspects – safety management, operational efficiency, communication and execution, and information transparency. Results show that the system significantly extends the lead time of risk warnings, reduces violation rates, stabilizes construction rhythm, shortens decision latency, and markedly improves the consistency between instruction delivery and on-site feedback.

  • Research Article
  • 10.1038/s41598-026-46558-y
Physics-constrained multimodal vision transformer for ultra-short-term solar radiation forecasting error correction
  • Apr 5, 2026
  • Scientific Reports
  • Ziyao Jiang + 1 more

Physics-constrained multimodal vision transformer for ultra-short-term solar radiation forecasting error correction

  • Research Article
  • 10.71465/gmssrj178
Foundation Models Enable Autonomous Collision Avoidance in Congested Orbital Environments
  • Apr 5, 2026
  • Global Media and Social Sciences Research Journal
  • Zhewei Fan + 2 more

The rapid proliferation of resident space objects in low Earth orbit has rendered traditional collision avoidance workflows increasingly inadequate for the scale and operational tempo of modern constellation management. This paper presents OrbiFM, a foundation model (FM)-based framework for autonomous collision avoidance in congested orbital environments. OrbiFM integrates a multi-modal transformer encoder with a physically constrained risk assessment head and an autoregressive maneuver decoder, processing conjunction data messages (CDM), two-line element (TLE)-derived orbital states, and space weather indices within a unified architecture adapted through low-rank adaptation (LoRA) fine-tuning. Simulation experiments across a synthetic catalog of 2,400 low Earth orbit (LEO) objects demonstrate that OrbiFM achieves a mean collision probability prediction error of 3.2%, a false positive maneuver trigger reduction of 12.2% relative to recurrent neural network baselines, and a per-satellite fuel saving of 18.6% over a 90-day evaluation window. Chain-of-thought inference additionally enables humaninterpretable decision justification, a critical prerequisite for regulatory trust in autonomous space traffic management systems.

  • Research Article
  • 10.1016/j.cad.2026.104082
Garment Pattern Accurate Reconstruction from 3D Point Clouds via a Multi-modal Transformer
  • Apr 1, 2026
  • Computer-Aided Design
  • Xiaoyuan Huang + 4 more

Garment Pattern Accurate Reconstruction from 3D Point Clouds via a Multi-modal Transformer

  • Research Article
  • 10.1016/j.jvir.2026.108321
Abstract No. 294 Multimodal Vision Transformer Modeling of Survival and Transplant Eligibility Following Radioembolization for Hepatocellular Carcinoma
  • Apr 1, 2026
  • Journal of Vascular and Interventional Radiology
  • T Mehta + 4 more

Abstract No. 294 Multimodal Vision Transformer Modeling of Survival and Transplant Eligibility Following Radioembolization for Hepatocellular Carcinoma

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.icte.2025.07.005
Transformer guided Multimodal VQA model for Fruit recognitions
  • Apr 1, 2026
  • ICT Express
  • Dat Tran + 1 more

Transformer guided Multimodal VQA model for Fruit recognitions

  • Research Article
  • 10.1038/s41598-026-45928-w
Predictive analysis of student engagement in university physical education courses based on a multimodal transformer algorithm.
  • Mar 26, 2026
  • Scientific reports
  • Jianping Li

Student engagement is a critical factor influencing teaching effectiveness in university physical education courses. To address common issues such as low attendance and insufficient classroom interaction in elective physical education courses, this study proposes an automated student engagement prediction model based on a multimodal Transformer algorithm. The model first utilizes the University Student Sports and Physical Health Dataset (https://www.ncmi.cn/phda/dataDetails.do?id=CSTR:17970.11.A0032.202412.278.V1.0) as its data source. After preprocessing, multimodal data are filtered and divided into a training set (80%) and a testing set (20%). Feature extraction is then performed on the multimodal data: a One-Dimensional Convolutional Neural Network (1D CNN) combined with Long Short-Term Memory (LSTM) processes sensor data, Bidirectional Encoder Representations from Transformers extracts text features, and Vision Transformer encodes video segments. Next, a hierarchical cross-modal Transformer architecture is designed. This architecture enhances single-modal feature representation through intra-modal self-attention and dynamically aligns heterogeneous data (e.g., the correlation between heart rate changes and "fatigue" text descriptions) using a cross-modal attention mechanism to achieve multimodal interaction. Finally, after fusing the cross-modal features, a fully connected layer outputs the student engagement prediction results. Performance analysis based on the specified data source reveals that the proposed model reduces the mean absolute error by 22.3% in the engagement regression task compared to the single-modal baseline (1D CNN+LSTM), and the F1-score for student engagement prediction increases to 0.81. Ablation experiments confirm the necessity of multimodal fusion; the proposed model achieves over 90% accuracy in student engagement prediction, whereas prediction performance decreases by 17%-35% when only a single modality is used. Furthermore, in terms of operational efficiency, the model can complete engagement prediction for a single class session (a 10-minute data window) within 0.2s, representing a 40% improvement in evaluation efficiency compared to baseline algorithms, thus meeting real-time classroom monitoring requirements. Therefore, this study significantly enhances the accuracy and real-time capability of student engagement prediction. Its interpretable cross-modal correlation analysis provides an intelligent decision-making basis for optimizing physical education teaching and offers a reference for advancing educational assessment from experience-driven to data-driven approaches.

  • Research Article
  • 10.1016/j.compbiolchem.2026.109010
TransDTAP: A multimodal transformer architecture for drug-target affinity prediction using sequence and biochemical properties.
  • Mar 13, 2026
  • Computational biology and chemistry
  • Abdelkader Bouguessa + 2 more

TransDTAP: A multimodal transformer architecture for drug-target affinity prediction using sequence and biochemical properties.

  • Research Article
  • 10.1007/s10489-026-07178-1
MIT-CA: Multi-modal interaction transformer with cross-attention for malware classification
  • Mar 9, 2026
  • Applied Intelligence
  • Meng Zhao + 6 more

MIT-CA: Multi-modal interaction transformer with cross-attention for malware classification

  • Research Article
  • 10.1038/s41598-026-43616-3
MM FD ConvFormer multimodal frequency aware deformable CNN transformer network for robust brain tumor classification.
  • Mar 9, 2026
  • Scientific reports
  • Anto Lourdu Xavier Raj Arockia Selvarathinam + 6 more

Accurate brain tumor classification from magnetic resonance imaging (MRI) is critical for early diagnosis and effective clinical decision-making. Although recent CNN-Transformer hybrid models have shown promising performance, most approaches rely primarily on single-modal spatial information, limiting their ability to capture complementary spectral features, model tumor heterogeneity, and generalize across datasets. To address these challenges, this paper proposes MM-FD-ConvFormer, a multimodal frequency-aware deformable CNN-Transformer network for robust brain tumor classification with enhanced interpretability. The proposed mode integrates three complementary modalities: (1) spatial MRI representations from original images, (2) frequency-domain MRI representations obtained via Fourier or wavelet transforms to capture texture and intensity variations, and (3) multi-scale contextual features for modeling global dependencies. A ConvNeXt V2 backbone is employed to extract discriminative spatial features, while a parallel lightweight ConvNeXt-based branch processes frequency-domain inputs. These features are subsequently fused and refined using a Swin Transformer V2 to capture long-range contextual relationships. To effectively integrate heterogeneous modalities and adapt to irregular tumor boundaries, a deformable cross-modal attention mechanism is introduced, enabling dynamic and shape-aware feature fusion. Final classification is performed on a unified multimodal representation, with an optional uncertainty-aware prediction head to improve reliability. The proposed model is evaluated using multiple public datasets, including the Kaggle Brain Tumor MRI and Figshare datasets for training, with external validation on the clinically relevant BraTS 2020/2021 dataset and optional testing on TCIA/REMBRANDT to assess cross-dataset generalization. Extensive experiments demonstrate that MM-FD-ConvFormer consistently outperforms standard CNN baselines, advanced transformer-based models, and hybrid approaches in terms of accuracy, macro-F1 score, and AUC. Furthermore, qualitative analyses using Grad-CAM, attention map visualization, and weakly supervised pseudo-segmentation provide interpretable insights into tumor localization and model decision-making. Overall, MM-FD-ConvFormer offers a robust, interpretable, and generalizable solution for automated brain tumor classification in real-world clinical imaging applications.

  • Research Article
  • 10.1038/s41598-026-43351-9
Hypergraph-based contrastive embedding and attention fusion for detection of skin cancer.
  • Mar 9, 2026
  • Scientific reports
  • Tathagat Banerjee + 5 more

Skin diseases involve a spectrum of problems including infections, and malignancies. Melanoma, the deadliest kind of skin cancer, starts in melanocytes, which make melanin. Early detection is really important, but it’s hard since the visual indications are often quite little and there is a big class imbalance in diagnostic datasets. The proposed C2G-HFMTA framework consists of three hierarchical levels: (a) an overall contrastive learning (CL) framework, (b)two major feature learning branches, namely the Graph Contrastive Embedding Framework (GCEF) and the High-dimensional Feature with Multimodal Transformer Attention (HFMTA), and (c) attention and fusion sub-modules including Hypergraph Bi-Convolutional Attention and Multiscale Transformer Attention, which operate within these branches to enhance discriminative representation learning. The proposed method demonstrates strong performance on benchmark dermoscopic datasets and has the potential to support computer-aided diagnosis systems, subject to further may support future computer-aided diagnosis systems validation and real-world testing. We have used Clustered Class-Based Segmentation (CCBS) for changing the training distributions. Our Class-Based Contrastive Loss (CBCL) works directly on original dermoscopic pictures, that preserves the semantic integrity of the images while making it easier to tell the difference between classes. Our framework outperforms several recent CNN- and transformer-based baselines in controlled experimental settings. It gets 93.2% accuracy and a 92.9% F1-score, and it does well on minority classes. Experiments were conducted on the HAM10000 dataset containing 10,015 dermoscopic images across seven diagnostic categories, using a stratified train–validation–test split of 70%–10%–20%. Performance was evaluated using accuracy, precision, recall, and F1-score, using five-fold stratified cross-validation to ensure robust performance estimation. Ablation experiments show that grouping, cross-branch fusion, and semantic-guided attention are important.

  • Research Article
  • 10.38094/jastt71658
Integrating Vision and Language: An Improved VAD Model
  • Mar 3, 2026
  • Journal of Applied Science and Technology Trends
  • Manas Ranjan Biswal + 1 more

Automatic anomaly detection in video surveillance is crucial for public and private safety. However, it is challenging because of unclear abnormal events, limited labeled data, and mismatches between different types of data. Traditional video anomaly detection methods mainly focus on spatiotemporal visual features. They often ignore semantic information and interactions between different data types. Additionally, many multimodal approaches use basic fusion methods that do not solve the alignment problems between these types of data. To address these issues, we propose a multimodal framework that includes a Hierarchical Multi-scale Temporal Network (H-MSTN). This network models short-, medium-, and long-term dependencies in visual and textual data. A lightweight cross-modal attention module makes sure the semantics align. Meanwhile, a Multimodal Attention-Based Fusion Transformer (MAFT) refines cross-modal representations in real time. We evaluate this framework using the UCF-Crime and XD-Violence benchmarks. The proposed method achieves 92.42% AUC on UCF-Crime and 88.63% AP on XD-Violence with significantly lower computational cost and faster inference than recent multimodal baselines such as ReFLIP-VAD. These results demonstrate a strong efficiency–accuracy trade-off for real-time deployment while maintaining competitive or improved performance over prior methods such as MVAD and TEVAD.

  • Research Article
  • 10.3390/computers15030161
From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Emotion-Aware Clinical Counseling Systems
  • Mar 3, 2026
  • Computers
  • Saahithi Mallarapu + 6 more

Computational analysis of therapeutic communication presents challenges in multi-label classification, severe class imbalance, and heterogeneous multimodal data integration. We introduce a bidirectional analytical framework addressing patient emotion recognition and provider behavior analysis. For patient-side analysis, we employ ClinicalBERT on human-annotated CounselChat (1482 interactions, 25 categories, imbalance 60:1), achieving a macro-F1 of 0.74 through class weighting and threshold optimization, representing a six-fold improvement over naive baselines and 6–13 point improvement over modern imbalance methods. For provider-side analysis, we process 330 YouTube therapy sessions through automated pipelines (speaker diarization, automatic speech recognition, temporal segmentation), yielding 14,086 annotated segments. Our architecture combines DeBERTa-v3-base with WavLM-base-plus through cross-modal attention mechanisms adapted from multimodal Transformer frameworks. On controlled human-annotated HOPE data (178 sessions, 12,500 utterances), the model achieves a macro-F1 of 0.91 with Cohen’s kappa of 0.87, comparable to inter-rater reliability reported in psychotherapy process research. On YouTube data, a macro-F1 of 0.71 demonstrates feasibility while highlighting annotation quality impacts. Cross-dataset transfer and systematic attention analyses validate domain-specific effectiveness and interpretability.

  • Research Article
  • 10.1016/j.bspc.2025.109108
CTNet: multi-modal channel attention transformer network for breast cancer image classification
  • Mar 1, 2026
  • Biomedical Signal Processing and Control
  • Muhammad Mumtaz Ali + 7 more

CTNet: multi-modal channel attention transformer network for breast cancer image classification

  • Research Article
  • 10.1016/j.measurement.2025.120151
Multimodal attention transformer for acoustic-seismic signal fusion target recognition
  • Mar 1, 2026
  • Measurement
  • Kangcheng Bin + 1 more

Multimodal attention transformer for acoustic-seismic signal fusion target recognition

  • Research Article
  • 10.1016/j.jvcir.2026.104736
Multimodal prompt-guided vision transformer for precise image manipulation localization
  • Mar 1, 2026
  • Journal of Visual Communication and Image Representation
  • Yafang Xiao + 5 more

Multimodal prompt-guided vision transformer for precise image manipulation localization

  • Research Article
  • 10.1007/s11760-026-05233-5
Towards explainable AI: multi-modal transformer for video-based image description generation
  • Mar 1, 2026
  • Signal, Image and Video Processing
  • Lakshita Agarwal + 1 more

Towards explainable AI: multi-modal transformer for video-based image description generation

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.bspc.2025.109039
Multimodal transformer for depression detection based on EEG and interview data
  • Mar 1, 2026
  • Biomedical Signal Processing and Control
  • Nima Esmi + 3 more

Depression detection benefits from combining neurological and behavioral indicators, yet integrating heterogeneous modalities such as EEG and interview audio remains challenging. We propose a transformer-based multimodal framework that jointly models spectral, spatial, and temporal EEG features alongside linguistic and paralinguistic cues from interviews. By employing synchronized multi-head cross-attention and self-attention mechanisms, the model effectively captures intra- and inter-modal correlations. In addition, a flexible temporal sequence matching strategy reduces EEG channel requirements, enhancing device portability. Evaluated on the MODMA and DAIC-WOZ datasets, our approach achieves superior performance compared to state-of-the-art models, with a 4.7% improvement in accuracy and a 10% increase in precision. These results demonstrate the potential of the proposed framework for accurate, scalable, and cost-effective depression detection in both clinical and real-world settings.

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • .
  • .
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers