Multimodal Transformer Research Articles

Visual commonsense reasoning (VCR) is a challenging reasoning task that aims to not only answer the question based on a given image but also provide a rationale justifying for the choice. Graph-based networks are appropriate to represent and extract the correlation between image and language for reasoning, where how to construct and learn graphs based on such multi-modal Euclidean data is a fundamental problem. Most existing graph-based methods view visual regions and linguistic words as identical graph nodes, ignoring inherent characteristics of multi-modal data. In addition, these approaches typically only have one graph-learning layer, and the performance declines as the model goes deeper. To address these issues, a novel method named Multi-modal Structure-embedding Graph Transformer (MSGT) is proposed. Specifically, an answer-vision graph and an answer-question graph are constructed to represent and model intra-modal and inter-modal correlations in VCR simultaneously, where additional multi-modal structure representations are initialized and embedded according to visual region distances and linguistic word orders for more reasonable graph representation. Then, a structure-injecting graph transformer is designed to inject embedded structure priors into the semantic correlation matrix for the evolution of node features and structure representations, which can stack more layers to make model deeper and extract more powerful features with instructive priors. To adaptively fuse graph features, a scored pooling mechanism is further developed to select valuable clues for reasoning from learnt node features. Experiments demonstrate the superiority of the proposed MSGT framework compared with state-of-the-art methods on the VCR benchmark dataset. The source code of this work can be found in <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://mic.tongji.edu.cn</uri> .

Diagnosing malignant skin tumors accurately at an early stage can be challenging due to ambiguous and even confusing visual characteristics displayed by various categories of skin tumors. To improve diagnosis precision, all available clinical data from multiple sources, particularly clinical images, dermoscopy images, and medical history, could be considered. Aligning with clinical practice, we propose a novel Transformer model, named Remix-Former++ that consists of a clinical image branch, a dermoscopy image branch, and a metadata branch. Given the unique characteristics inherent in clinical and dermoscopy images, specialized attention strategies are adopted for each type. Clinical images are processed through a top-down architecture, capturing both localized lesion details and global contextual information. Conversely, dermoscopy images undergo a bottom-up processing with two-level hierarchical encoders, designed to pinpoint fine-grained structural and textural features. A dedicated metadata branch seamlessly integrates non-visual information by encoding relevant patient data. Fusing features from three branches substantially boosts disease classification accuracy. RemixFormer++ demonstrates exceptional performance on four single-modality datasets (PAD-UFES-20, ISIC 2017/2018/2019). Compared with the previous best method using a public multi-modal Derm7pt dataset, we achieved an absolute 5.3% increase in averaged F1 and 1.2% in accuracy for the classification of five skin tumors. Furthermore, using a large-scale in-house dataset of 10,351 patients with the twelve most common skin tumors, our method obtained an overall classification accuracy of 92.6%. These promising results, on par or better with the performance of 191 dermatologists through a comprehensive reader study, evidently imply the potential clinical usability of our method.

Multimodal Transformer Research Articles

Related Topics

Articles published on Multimodal Transformer

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Integrating GIN-based multimodal feature transformation and multi-feature combination voting for irony-aware cyberbullying detection

Accurate estimation of biological age and its application in disease prediction using a multimodal image Transformer system.

Multimodal Transformer of Incomplete MRI Data for Brain Tumor Segmentation.

Positive Unlabeled Fake News Detection via Multi-Modal Masked Transformer Network

Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Strategies for Multimodal Image Data Transformation to a Common Format for Cloud Integration and Visualization

CAM-Vtrans: real-time sports training utilizing multi-modal robot data.

Local Climate Zone Classification via Semi-Supervised Multimodal Multiscale Transformer

Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering.

MBGT: Encoding Brain Signals With Multimodal Brain Graph Transformer

Learning to Answer Visual Questions from Web Videos.

RemixFormer++: A Multi-modal Transformer Model for Precision Skin Tumor Differential Diagnosis with Memory-efficient Attention.

An Improved ConvNeXt with Multimodal Transformer for Physiological Signal Classification

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

Exploring Multi-modal Spatial-Temporal Contexts for High-performance RGB-T Tracking.

Efficient Multimodal Transformer With Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis

Network security situation assessment and prediction method based on multimodal transformation in edge computing

Dual-adaptive interactive transformer with textual and visual context for image captioning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Transformer Research Articles

Related Topics

Articles published on Multimodal Transformer

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Integrating GIN-based multimodal feature transformation and multi-feature combination voting for irony-aware cyberbullying detection

Accurate estimation of biological age and its application in disease prediction using a multimodal image Transformer system.

Multimodal Transformer of Incomplete MRI Data for Brain Tumor Segmentation.

Positive Unlabeled Fake News Detection via Multi-Modal Masked Transformer Network

Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning

Strategies for Multimodal Image Data Transformation to a Common Format for Cloud Integration and Visualization

CAM-Vtrans: real-time sports training utilizing multi-modal robot data.

Local Climate Zone Classification via Semi-Supervised Multimodal Multiscale Transformer

Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering.

MBGT: Encoding Brain Signals With Multimodal Brain Graph Transformer

Learning to Answer Visual Questions from Web Videos.

RemixFormer++: A Multi-modal Transformer Model for Precision Skin Tumor Differential Diagnosis with Memory-efficient Attention.

An Improved ConvNeXt with Multimodal Transformer for Physiological Signal Classification

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

Exploring Multi-modal Spatial-Temporal Contexts for High-performance RGB-T Tracking.

Efficient Multimodal Transformer With Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis

Network security situation assessment and prediction method based on multimodal transformation in edge computing

Dual-adaptive interactive transformer with textual and visual context for image captioning