Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding.

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

Dynamic MDETR introduces a decoupled encoding-decoding multimodal transformer with a dynamic decoder exploiting image sparsity, reducing GFLOPs by 44% using only 9% feature points, while surpassing encoder-only models in accuracy and achieving state-of-the-art results with minimal additional computational cost.

Abstract
Translate article icon Translate Article Star icon

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic Mutilmodal detection transformer (DETR) (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce ∼44% GFLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. With the same number of encoder layers as TransVG, our Dynamic MDETR (ResNet-50) outperforms TransVG (ResNet-101) but only brings marginal extra computational cost relative to TransVG. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.

Similar Papers
  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 20
  • 10.18653/v1/2021.findings-acl.38
Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation
  • Jan 1, 2021
  • Feilong Chen + 4 more

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/app142210157
Multi-Modal Vision Transformer with Explainable Shapley Additive Explanations Value Embedding for Cymbidium goeringii Quality Grading
  • Nov 6, 2024
  • Applied Sciences
  • Zhen Wang + 3 more

Cymbidium goeringii (Rchb. f.) is a traditional Chinese flower with highly valued biological, cultural, and artistic properties. However, the valuation of Rchb. f. mainly relies on subjective judgment, lacking a standardized digital evaluation and grading methods. Traditional grading methods solely rely on unimodal data and are based on fuzzy grading standards; the key features for values are especially inexplicable. Accurately evaluating Rchb. f. quality through multi-modal algorithms and clarifying the impact mechanism of key features on Rchb. f. value is essential for providing scientific references for online orchid trading. A multi-modal Transformer for Rchb. f. quality grading combined with the Shapley Additive Explanations (SHAP) algorithm was proposed, which mainly includes one embedding layer, one UNet, one Vision Transformer (ViT) and one Encoder layer. A multi-modal orchid dataset including images and text was obtained from Orchid Trading Website, and seven key features were extracted. Based on petals’ RGB segmented from UNet and global fine-grained features extracted from ViT, text features and image features were organically fused into Transformer Encoders throughout concatenation operation, a 93.13% accuracy was achieved. Furthermore, SHAP algorithm was utilized to quantify and rank the importance of seven features, clarifying the impact mechanism of key features on Rchb. f. quality and value. This multi-modal Transformer with SHAP algorithm for Rchb. f. grading provided a novel idea to represent the explainable features accurately, exhibiting good potential for establishing a reliable digital evaluation method for agricultural products with high value.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.neucom.2024.128621
Zero-shot visual grounding via coarse-to-fine representation learning
  • Sep 16, 2024
  • Neurocomputing
  • Jinpeng Mi + 5 more

Zero-shot visual grounding via coarse-to-fine representation learning

  • Conference Article
  • Cite Count Icon 26
  • 10.1109/cvpr52688.2022.01509
Multi-Modal Dynamic Graph Transformer for Visual Grounding
  • Jun 1, 2022
  • Sijia Chen + 1 more

Visual grounding (VG) aims to align the correct regions of an image with a natural language query about that image. We found that existing VG methods are trapped by the single-stage grounding process that performs a sole evaluate-and-rank for meticulously prepared regions. Their performance depends on the density and quality of the candidate regions, and is capped by the inability to optimize the located regions continuously. To address these issues, we propose to remodel VG into a progressively optimized visual semantic alignment process. Our proposed multi-modal dynamic graph transformer (M-DGT) achieves this by building upon the dynamic graph structure with regions as nodes and their semantic relations as edges. Starting from a few randomly initialized regions, M-DGT is able to make sustainable adjustments (i.e., 2D spatial transformation and deletion) to the nodes and edges of the graph based on multi-modal information and the graph feature, thereby efficiently shrinking the graph to approach the ground truth regions. Experiments show that with an average of 48 boxes as initialization, the performance of M-DGT on the Flickr30k Entities and RefCOCO datasets outperforms existing state-of-the-art methods by a substantial margin, in terms of both accuracy and Intersect over Union (IOU) scores. Furthermore, introducing M-DGT to optimize the predicted regions of existing methods can further significantly improve their performance. The source codes are available at https://github.com/iQua/M-DGT.

  • Research Article
  • Cite Count Icon 5
  • 10.1109/tmm.2025.3535345
Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding
  • Jan 1, 2025
  • IEEE Transactions on Multimedia
  • Minghong Xie + 5 more

Visual grounding has attracted wide attention thanks to its broad application in various visual language tasks. Although visual grounding has made significant research progress, existing methods ignore the promotion effect of the association between text and image features at different hierarchies on cross-modal matching. This paper proposes a Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction Visual Grounding method. It first generates a mask through decoupled sentence phrases, and a text and image hierarchical matching mechanism is constructed, highlighting the role of association between different hierarchies in cross-modal matching. In addition, a corresponding target object position progressive correction strategy is defined based on the hierarchical matching mechanism to achieve accurate positioning for the target object described in the text. This method can continuously optimize and adjust the bounding box position of the target object as the certainty of the text description of the target object improves. This design explores the association between features at different hierarchies and highlights the role of features related to the target object and its position in target positioning. The proposed method is validated on different datasets through experiments, and its superiority is verified by the performance comparison with the state-of-the-art methods.

  • Research Article
  • 10.1109/tmm.2025.3608295
Hierarchical Multi-Modal Transformer for Cross-Modal Long Document Classification
  • Jan 1, 2025
  • IEEE Transactions on Multimedia
  • Tengfei Liu + 4 more

Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have only focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3390/app13095649
Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers
  • May 4, 2023
  • Applied Sciences
  • Qianjun Zhang + 1 more

Multi-modal deep learning methods have achieved great improvements in visual grounding; their objective is to localize text-specified objects in images. Most of the existing methods can localize and classify objects with significant appearance differences but suffer from the misclassification problem for extremely similar objects, due to inadequate exploration of multi-modal features. To address this problem, we propose a novel semantic-aligned cross-modal visual grounding network with transformers (SAC-VGNet). SAC-VGNet integrates visual and textual features with semantic alignment to highlight important feature cues for capturing tiny differences between similar objects. Technically, SAC-VGNet incorporates a multi-modal fusion module to effectively fuse visual and textual descriptions. It also introduces contrastive learning to align linguistic and visual features on the text-to-pixel level, enabling the capture of subtle differences between objects. The overall architecture is end-to-end without the need for extra parameter settings. To evaluate our approach, we manually annotate text descriptions for images in two fine-grained visual grounding datasets. The experimental results demonstrate that SAC-VGNet significantly improves performance in fine-grained visual grounding.

  • Research Article
  • 10.3390/literature5020013
Judging Books by Their Covers: The Impact of Text and Image Features on the Aesthetic Evaluation and Memorability of Italian Novels
  • Jun 7, 2025
  • Literature
  • Kirren Chana + 6 more

Book covers are often the first component seen before a reader engages with a book’s contents; therefore, careful consideration is given to the text and image features that constitute their design. This study investigates the effects of the presentation of verbal (text) and visual (image) features on memorability and aesthetic evaluation in the context of book covers. To this aim, 50 participants took part in a memory recognition task in which the same book cover information was encoded in a learning phase, and either text or image features from the book covers acted as an informational cue for memory recognition and aesthetic evaluations. Our results revealed that image features significantly aided memory performance more than text features. Image features that were rated more beautiful were not better recognized as a result. However, differences in memory performance were found in relation to familiarity and, in a non-linear fashion, the extent to which the book’s contents could be inferred from the image’s informational content. Additionally, reading behavior was not found to influence memory performance. These results are discussed with regard to the interplay of text and image informational cues on book cover perception and provide implications for future studies.

  • Research Article
  • 10.1080/17538947.2025.2512059
Transferring CLIP for visual grounding in remote sensing images
  • Jun 17, 2025
  • International Journal of Digital Earth
  • Linlin Liang + 4 more

Remote Sensing Visual Grounding (RSVG) task aims to localize specific objects in remote sensing (RS) images based on natural language queries and holds considerable potential for various applications. Existing approaches primarily rely on unimodal pre-trained encoders, leading to insufficient cross-modal information alignment. Moreover, these methods incur high computational costs for full fine-tuning of both visual and language encoders to achieve modality alignment, thereby constraining localization performance. Therefore, we propose a CLIP (Contrastive Language-Image Pretraining)-based remote sensing visual grounding framework, RSCLIPVG. RSCLIPVG employs a frozen CLIP model to extract visual and textual features, and we introduce a lightweight visual adapter to adapt visual representations, efficiently transferring the rich multimodal knowledge of CLIP to the RSVG scenario. Furthermore, a Multi-Level Collaborative Cross-modal Enhancement (MLCCME) module is developed to refine and integrate multi-level visual and textual features, enabling comprehensive cross-modal interaction and alignment. This effectively enhances the feature representation of target objects, thereby mitigating issues such as scale variations and cluttered backgrounds in remote sensing imagery. Experimental results on the DIOR-RSVG dataset indicate that our approach significantly outperforms previous methods. These research findings demonstrate the potential of CLIP in RSVG tasks, offering new solutions and perspectives for the RSVG field.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/cas47993.2019.9075725
Spam Email Image Classification Based on Text and Image Features
  • Dec 1, 2019
  • Estqlal Hammad Dhah + 2 more

Filtering of spam image-based email remains a major challenge for researchers. This paper presents a proposed work which is based on several facts such that spam images containing a large percentage of text which has characteristics or features different from other types of images. In addition to that, there is much similarity between the features of these images. These facts can be used to distinguish text regions spam images from others. A hybrid method based on combined features vector from text regions and features of the image is proposed. Two types of features are extracted. The first features extraction method is the local binary pattern (LBP) with extricating the image texture features directly, while the second is utilised to extricate features of image text regions only. The extracted features are used in individual and combination style in order to learn classifiers at the training stage. A one-class KNN classifier and two-class KNN classifier are applied separately. Each classifier was used in three fashion, with the text-regions features, with texture features in the image, and with merging both those features respectively. Experimental results showed that the appropriation of both image and text features together improves the effectiveness of the classification concerning the case in which only image or text features are used.

  • Conference Article
  • Cite Count Icon 89
  • 10.1109/cvpr52688.2022.01506
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
  • Jun 1, 2022
  • Jiabo Ye + 7 more

Visual grounding focuses on establishing fine-grained alignment between vision and natural language, which has essential applications in multimodal reasoning systems. Existing methods use pre-trained query-agnostic visual backbones to extract visual feature maps independently without considering the query information. We argue that the visual features extracted from the visual backbones and the features really needed for multimodal reasoning are inconsistent. One reason is that there are differences between pre-training tasks and visual grounding. Moreover, since the backbones are query-agnostic, it is difficult to completely avoid the inconsistency issue by training the visual backbone end-to-end in the visual grounding framework. In this paper, we propose a Query-modulated Refinement Network (QRNet) to address the inconsistent issue by adjusting intermediate features in the visual backbone with a novel Query-aware Dynamic Attention (QD-ATT) mechanism and query-aware multiscale fusion. The QD-ATT can dynamically compute query-dependent visual attention at the spatial and channel levels of the feature maps produced by the visual backbone. We apply the QRNet to an end-to-end visual grounding framework. Extensive experiments show that the proposed method outperforms state-of-the-art methods on five widely used datasets. Our code is available at https://github.com/LukeForeverYoung/QRNet.

  • Research Article
  • 10.1109/jsen.2026.3665551
Adaptive noise reduction pipeline leakage detection method based on Multimodal Transformer and contrastive learning
  • Jan 1, 2026
  • IEEE Sensors Journal
  • Xianming Lang + 3 more

Under the dual constraints of strong industrial noise and small sample size, achieving high precision and reliability in pipeline leakage detection is a significant challenge. The root cause lies in the fact that traditional single-modal methods have a single perception dimension and insufficient anti-interference ability, making it difficult to effectively distinguish strong background noise with overlapping spectra, resulting in reduced measurement accuracy and reliability. To address this issue, this paper proposes an adaptive noise reduction detection method based on multi-modal Transformer and contrastive learning (MDT-CL). This paper uses pressure, vibration, and acoustic emission sensors to capture complementary signals, and synchronizes the collection of these signals through dynamic time warping technology. The proposed method comprises three key innovations. Firstly, we propose a multi-scale Transformer encoder based on physical principles, employing an asymmetric cross-modal attention mechanism to focus on pressure signals that are sensitive to leakage. We integrate vibration and acoustic features and fuse them at the feature level, fundamentally solving the heterogeneity and spatio-temporal mismatch problems of multi-source asynchronous sensor data. Secondly, we propose an embedded adaptive spectral normalization module, which is a dynamic frequency-domain filtering technology that collaboratively optimizes with the network. It intelligently suppresses noise-sensitive frequency bands and synchronously enhances leakage features, effectively overcoming the limitations of traditional static filtering. Thirdly, we propose a contrastive learning pre-training strategy for pipeline noise. By constructing a multi-modal noise enhancement library and feature decoupling mechanism, it significantly expands the inter-class distance between leakage and normal operating condition noise, enhancing the model’s generalization ability in complex noise conditions. Experimental results demonstrate that the proposed method achieves a leakage detection accuracy of 96.8%, representing a 1.6% improvement over the current state-of-the-art method.

  • Research Article
  • 10.1109/jstars.2025.3575770
Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing
  • Jan 1, 2025
  • IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
  • Zibo Hu + 6 more

Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multi-object detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multi-scale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This paper addresses this gap by proposing an Enhanced Grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the Multi-Scale Visual-Cross-Text Fusion Module (MSVCTFM) and Inverse Pyramid Feature Refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multi-scale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multi-scale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multi-scale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.

  • Video Transcripts
  • 10.48448/ptvs-ar18
Flexible Visual Grounding
  • May 11, 2022
  • Underline Science Inc.
  • Sadao Kurohashi + 2 more

Existing visual grounding datasets are artificially made, where every query regarding an entity must be able to be grounded to a corresponding image region, i.e., answerable. However, in real-world multimedia data such as news articles and social media, many entities in the text cannot be grounded to the image, i.e., unanswerable, due to the fact that the text is unnecessarily directly describing the accompanying image. A robust visual grounding model should be able to flexibly deal with both answerable and unanswerable visual grounding. To study this flexible visual grounding problem, we construct a pseudo dataset and a social media dataset including both answerable and unanswerable queries. In order to handle unanswerable visual grounding, we propose a novel method by adding a pseudo image region corresponding to a query that cannot be grounded. The model is then trained to ground to ground-truth regions for answerable queries and pseudo regions for unanswerable queries. In our experiments, we show that our model can flexibly process both answerable and unanswerable queries with high accuracy on our datasets.

  • Conference Article
  • Cite Count Icon 332
  • 10.1109/iccv48922.2021.00179
TransVG: End-to-End Visual Grounding with Transformers
  • Oct 1, 2021
  • Jiajun Deng + 4 more

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image scene graph, makes the models easily overfit to datasets with specific scenarios, and limits the plenitudinous interaction between the visual-linguistic context. To avoid this caveat, we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes). Extensive experiments are conducted on five widely used datasets, and a series of state-of-the-art records are set by our TransVG. We build the benchmark of transformer-based visual grounding framework and make the code available at https://github.com/djiajunustc/TransVG.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant