Related Topics
Articles published on Multimodal Transformer
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
471 Search results
Sort by Recency
- New
- Research Article
1
- 10.1016/j.media.2026.103966
- May 1, 2026
- Medical image analysis
- Jiahao Xu + 5 more
Multimodal sparse fusion transformer network with spatio-temporal decoupling for breast tumor classification.
- New
- Research Article
- 10.3390/sym18050723
- Apr 24, 2026
- Symmetry
- Chengkai Tang + 4 more
Low-Earth orbit satellites are gradually becoming the core infrastructure of integrated aerospace communication networks, with their significant advantages of high communication rates, small transmission delay, and wide coverage. Interference with military communications in response to their security and protection needs is a current research challenge. Consequently, this paper introduces an interference technique optimized for low-Earth orbit satellite signals using a multimodal learning transformer model (OI-MLT). The proposed method incorporates symmetry-aware design by exploiting the inherent time–frequency structural characteristics of LEO satellite signals and the spatially distributed topology of interference sources. An optimized model for distributed interference sources is developed, and multimodal information of spectra and numerical values is processed in parallel through the self-attention mechanism. This approach effectively addresses the problem of dynamic matching between the interference signal and target signal in high-speed LEO scenarios, as well as high-precision interference synchronization under time-varying channels. Experimental results demonstrate that this technique enhances the precision of frequency tracking, reduces the time required for synchronization establishment, and improves the interference success rate by 27.52% on average compared with existing methods.
- New
- Research Article
- 10.3846/jcem.2026.26154
- Apr 20, 2026
- Journal of Civil Engineering and Management
- Fengyi Guo + 4 more
Construction sites routinely face multi-trade concurrency, spatiotemporal coupling, and high safety risk; relying solely on manual inspection and heuristic scheduling often leads to lagging detection and inconsistent execution. In response, recent practice has introduced digital twins (DT) to fuse video, sensors, and BIM and thus improve site visibility; however, most implementations remain at monitoring/visualization, lacking a mechanism to convert cognition into executable, verifiable decisions. Meanwhile, Transformer foundation models show strong capabilities in multimodal perception and representation learning, yet they are rarely closed-looped with engineering constraints and on-site execution. Against this backdrop, taking high-rise self-climbing platform (SCP) operations as a representative scenario, we build a DT×Transformer closed-loop system. We align video/sensor/BIM/text at the component level via “Component-ID + Timestamp”, train a multimodal Transformer for operation-state recognition and short-horizon risk prediction, and then explicitly encode safety, resource, and spatial precedence constraints in a policy module to generate feasible task sequences, which are delivered to crews via AR with acknowledgments to close the loop. The system integrates multisource perception, digital twin, foundation-model reasoning, and AR-assisted execution, and was validated on a highrise self-climbing platform project for its overall improvement of construction performance. The evaluation covered four key aspects – safety management, operational efficiency, communication and execution, and information transparency. Results show that the system significantly extends the lead time of risk warnings, reduces violation rates, stabilizes construction rhythm, shortens decision latency, and markedly improves the consistency between instruction delivery and on-site feedback.
- Research Article
- 10.1038/s41598-026-46558-y
- Apr 5, 2026
- Scientific Reports
- Ziyao Jiang + 1 more
Physics-constrained multimodal vision transformer for ultra-short-term solar radiation forecasting error correction
- Research Article
- 10.71465/gmssrj178
- Apr 5, 2026
- Global Media and Social Sciences Research Journal
- Zhewei Fan + 2 more
The rapid proliferation of resident space objects in low Earth orbit has rendered traditional collision avoidance workflows increasingly inadequate for the scale and operational tempo of modern constellation management. This paper presents OrbiFM, a foundation model (FM)-based framework for autonomous collision avoidance in congested orbital environments. OrbiFM integrates a multi-modal transformer encoder with a physically constrained risk assessment head and an autoregressive maneuver decoder, processing conjunction data messages (CDM), two-line element (TLE)-derived orbital states, and space weather indices within a unified architecture adapted through low-rank adaptation (LoRA) fine-tuning. Simulation experiments across a synthetic catalog of 2,400 low Earth orbit (LEO) objects demonstrate that OrbiFM achieves a mean collision probability prediction error of 3.2%, a false positive maneuver trigger reduction of 12.2% relative to recurrent neural network baselines, and a per-satellite fuel saving of 18.6% over a 90-day evaluation window. Chain-of-thought inference additionally enables humaninterpretable decision justification, a critical prerequisite for regulatory trust in autonomous space traffic management systems.
- Research Article
- 10.1016/j.cad.2026.104082
- Apr 1, 2026
- Computer-Aided Design
- Xiaoyuan Huang + 4 more
Garment Pattern Accurate Reconstruction from 3D Point Clouds via a Multi-modal Transformer
- Research Article
- 10.1016/j.jvir.2026.108321
- Apr 1, 2026
- Journal of Vascular and Interventional Radiology
- T Mehta + 4 more
Abstract No. 294 Multimodal Vision Transformer Modeling of Survival and Transplant Eligibility Following Radioembolization for Hepatocellular Carcinoma
- Research Article
1
- 10.1016/j.icte.2025.07.005
- Apr 1, 2026
- ICT Express
- Dat Tran + 1 more
Transformer guided Multimodal VQA model for Fruit recognitions
- Research Article
- 10.1038/s41598-026-45928-w
- Mar 26, 2026
- Scientific reports
- Jianping Li
Student engagement is a critical factor influencing teaching effectiveness in university physical education courses. To address common issues such as low attendance and insufficient classroom interaction in elective physical education courses, this study proposes an automated student engagement prediction model based on a multimodal Transformer algorithm. The model first utilizes the University Student Sports and Physical Health Dataset (https://www.ncmi.cn/phda/dataDetails.do?id=CSTR:17970.11.A0032.202412.278.V1.0) as its data source. After preprocessing, multimodal data are filtered and divided into a training set (80%) and a testing set (20%). Feature extraction is then performed on the multimodal data: a One-Dimensional Convolutional Neural Network (1D CNN) combined with Long Short-Term Memory (LSTM) processes sensor data, Bidirectional Encoder Representations from Transformers extracts text features, and Vision Transformer encodes video segments. Next, a hierarchical cross-modal Transformer architecture is designed. This architecture enhances single-modal feature representation through intra-modal self-attention and dynamically aligns heterogeneous data (e.g., the correlation between heart rate changes and "fatigue" text descriptions) using a cross-modal attention mechanism to achieve multimodal interaction. Finally, after fusing the cross-modal features, a fully connected layer outputs the student engagement prediction results. Performance analysis based on the specified data source reveals that the proposed model reduces the mean absolute error by 22.3% in the engagement regression task compared to the single-modal baseline (1D CNN+LSTM), and the F1-score for student engagement prediction increases to 0.81. Ablation experiments confirm the necessity of multimodal fusion; the proposed model achieves over 90% accuracy in student engagement prediction, whereas prediction performance decreases by 17%-35% when only a single modality is used. Furthermore, in terms of operational efficiency, the model can complete engagement prediction for a single class session (a 10-minute data window) within 0.2s, representing a 40% improvement in evaluation efficiency compared to baseline algorithms, thus meeting real-time classroom monitoring requirements. Therefore, this study significantly enhances the accuracy and real-time capability of student engagement prediction. Its interpretable cross-modal correlation analysis provides an intelligent decision-making basis for optimizing physical education teaching and offers a reference for advancing educational assessment from experience-driven to data-driven approaches.
- Research Article
- 10.1016/j.compbiolchem.2026.109010
- Mar 13, 2026
- Computational biology and chemistry
- Abdelkader Bouguessa + 2 more
TransDTAP: A multimodal transformer architecture for drug-target affinity prediction using sequence and biochemical properties.
- Research Article
- 10.1007/s10489-026-07178-1
- Mar 9, 2026
- Applied Intelligence
- Meng Zhao + 6 more
MIT-CA: Multi-modal interaction transformer with cross-attention for malware classification
- Research Article
- 10.1038/s41598-026-43616-3
- Mar 9, 2026
- Scientific reports
- Anto Lourdu Xavier Raj Arockia Selvarathinam + 6 more
Accurate brain tumor classification from magnetic resonance imaging (MRI) is critical for early diagnosis and effective clinical decision-making. Although recent CNN-Transformer hybrid models have shown promising performance, most approaches rely primarily on single-modal spatial information, limiting their ability to capture complementary spectral features, model tumor heterogeneity, and generalize across datasets. To address these challenges, this paper proposes MM-FD-ConvFormer, a multimodal frequency-aware deformable CNN-Transformer network for robust brain tumor classification with enhanced interpretability. The proposed mode integrates three complementary modalities: (1) spatial MRI representations from original images, (2) frequency-domain MRI representations obtained via Fourier or wavelet transforms to capture texture and intensity variations, and (3) multi-scale contextual features for modeling global dependencies. A ConvNeXt V2 backbone is employed to extract discriminative spatial features, while a parallel lightweight ConvNeXt-based branch processes frequency-domain inputs. These features are subsequently fused and refined using a Swin Transformer V2 to capture long-range contextual relationships. To effectively integrate heterogeneous modalities and adapt to irregular tumor boundaries, a deformable cross-modal attention mechanism is introduced, enabling dynamic and shape-aware feature fusion. Final classification is performed on a unified multimodal representation, with an optional uncertainty-aware prediction head to improve reliability. The proposed model is evaluated using multiple public datasets, including the Kaggle Brain Tumor MRI and Figshare datasets for training, with external validation on the clinically relevant BraTS 2020/2021 dataset and optional testing on TCIA/REMBRANDT to assess cross-dataset generalization. Extensive experiments demonstrate that MM-FD-ConvFormer consistently outperforms standard CNN baselines, advanced transformer-based models, and hybrid approaches in terms of accuracy, macro-F1 score, and AUC. Furthermore, qualitative analyses using Grad-CAM, attention map visualization, and weakly supervised pseudo-segmentation provide interpretable insights into tumor localization and model decision-making. Overall, MM-FD-ConvFormer offers a robust, interpretable, and generalizable solution for automated brain tumor classification in real-world clinical imaging applications.
- Research Article
- 10.1038/s41598-026-43351-9
- Mar 9, 2026
- Scientific reports
- Tathagat Banerjee + 5 more
Skin diseases involve a spectrum of problems including infections, and malignancies. Melanoma, the deadliest kind of skin cancer, starts in melanocytes, which make melanin. Early detection is really important, but it’s hard since the visual indications are often quite little and there is a big class imbalance in diagnostic datasets. The proposed C2G-HFMTA framework consists of three hierarchical levels: (a) an overall contrastive learning (CL) framework, (b)two major feature learning branches, namely the Graph Contrastive Embedding Framework (GCEF) and the High-dimensional Feature with Multimodal Transformer Attention (HFMTA), and (c) attention and fusion sub-modules including Hypergraph Bi-Convolutional Attention and Multiscale Transformer Attention, which operate within these branches to enhance discriminative representation learning. The proposed method demonstrates strong performance on benchmark dermoscopic datasets and has the potential to support computer-aided diagnosis systems, subject to further may support future computer-aided diagnosis systems validation and real-world testing. We have used Clustered Class-Based Segmentation (CCBS) for changing the training distributions. Our Class-Based Contrastive Loss (CBCL) works directly on original dermoscopic pictures, that preserves the semantic integrity of the images while making it easier to tell the difference between classes. Our framework outperforms several recent CNN- and transformer-based baselines in controlled experimental settings. It gets 93.2% accuracy and a 92.9% F1-score, and it does well on minority classes. Experiments were conducted on the HAM10000 dataset containing 10,015 dermoscopic images across seven diagnostic categories, using a stratified train–validation–test split of 70%–10%–20%. Performance was evaluated using accuracy, precision, recall, and F1-score, using five-fold stratified cross-validation to ensure robust performance estimation. Ablation experiments show that grouping, cross-branch fusion, and semantic-guided attention are important.
- Research Article
- 10.38094/jastt71658
- Mar 3, 2026
- Journal of Applied Science and Technology Trends
- Manas Ranjan Biswal + 1 more
Automatic anomaly detection in video surveillance is crucial for public and private safety. However, it is challenging because of unclear abnormal events, limited labeled data, and mismatches between different types of data. Traditional video anomaly detection methods mainly focus on spatiotemporal visual features. They often ignore semantic information and interactions between different data types. Additionally, many multimodal approaches use basic fusion methods that do not solve the alignment problems between these types of data. To address these issues, we propose a multimodal framework that includes a Hierarchical Multi-scale Temporal Network (H-MSTN). This network models short-, medium-, and long-term dependencies in visual and textual data. A lightweight cross-modal attention module makes sure the semantics align. Meanwhile, a Multimodal Attention-Based Fusion Transformer (MAFT) refines cross-modal representations in real time. We evaluate this framework using the UCF-Crime and XD-Violence benchmarks. The proposed method achieves 92.42% AUC on UCF-Crime and 88.63% AP on XD-Violence with significantly lower computational cost and faster inference than recent multimodal baselines such as ReFLIP-VAD. These results demonstrate a strong efficiency–accuracy trade-off for real-time deployment while maintaining competitive or improved performance over prior methods such as MVAD and TEVAD.
- Research Article
- 10.3390/computers15030161
- Mar 3, 2026
- Computers
- Saahithi Mallarapu + 6 more
Computational analysis of therapeutic communication presents challenges in multi-label classification, severe class imbalance, and heterogeneous multimodal data integration. We introduce a bidirectional analytical framework addressing patient emotion recognition and provider behavior analysis. For patient-side analysis, we employ ClinicalBERT on human-annotated CounselChat (1482 interactions, 25 categories, imbalance 60:1), achieving a macro-F1 of 0.74 through class weighting and threshold optimization, representing a six-fold improvement over naive baselines and 6–13 point improvement over modern imbalance methods. For provider-side analysis, we process 330 YouTube therapy sessions through automated pipelines (speaker diarization, automatic speech recognition, temporal segmentation), yielding 14,086 annotated segments. Our architecture combines DeBERTa-v3-base with WavLM-base-plus through cross-modal attention mechanisms adapted from multimodal Transformer frameworks. On controlled human-annotated HOPE data (178 sessions, 12,500 utterances), the model achieves a macro-F1 of 0.91 with Cohen’s kappa of 0.87, comparable to inter-rater reliability reported in psychotherapy process research. On YouTube data, a macro-F1 of 0.71 demonstrates feasibility while highlighting annotation quality impacts. Cross-dataset transfer and systematic attention analyses validate domain-specific effectiveness and interpretability.
- Research Article
- 10.1016/j.bspc.2025.109108
- Mar 1, 2026
- Biomedical Signal Processing and Control
- Muhammad Mumtaz Ali + 7 more
CTNet: multi-modal channel attention transformer network for breast cancer image classification
- Research Article
- 10.1016/j.measurement.2025.120151
- Mar 1, 2026
- Measurement
- Kangcheng Bin + 1 more
Multimodal attention transformer for acoustic-seismic signal fusion target recognition
- Research Article
- 10.1016/j.jvcir.2026.104736
- Mar 1, 2026
- Journal of Visual Communication and Image Representation
- Yafang Xiao + 5 more
Multimodal prompt-guided vision transformer for precise image manipulation localization
- Research Article
- 10.1007/s11760-026-05233-5
- Mar 1, 2026
- Signal, Image and Video Processing
- Lakshita Agarwal + 1 more
Towards explainable AI: multi-modal transformer for video-based image description generation
- Research Article
1
- 10.1016/j.bspc.2025.109039
- Mar 1, 2026
- Biomedical Signal Processing and Control
- Nima Esmi + 3 more
Depression detection benefits from combining neurological and behavioral indicators, yet integrating heterogeneous modalities such as EEG and interview audio remains challenging. We propose a transformer-based multimodal framework that jointly models spectral, spatial, and temporal EEG features alongside linguistic and paralinguistic cues from interviews. By employing synchronized multi-head cross-attention and self-attention mechanisms, the model effectively captures intra- and inter-modal correlations. In addition, a flexible temporal sequence matching strategy reduces EEG channel requirements, enhancing device portability. Evaluated on the MODMA and DAIC-WOZ datasets, our approach achieves superior performance compared to state-of-the-art models, with a 4.7% improvement in accuracy and a 10% increase in precision. These results demonstrate the potential of the proposed framework for accurate, scalable, and cost-effective depression detection in both clinical and real-world settings.