Referring Image Segmentation with Two-Stage Multi-Modal Interaction
The objective of referring image segmentation is to extract referred entities from an image using a particular natural language sentence. The main idea for this task is interacting textual and visual features to build multi-modal relationships. The prior state-of-the-art methods mainly focus on local multilevel intermediate feature interaction or global text-to-image alignment, which might result in insufficient interaction for capturing global multi-modal information exchange or finegrained referred object details, respectively. To overcome this issue, we introduce a referring image segmentation framework with two-stage multi-modal interaction. Specifically, we devise an innovative multi-level cross-modal fusion module to effectively facilitate the interaction of intermediate features of linguistic and visual modalities for fine-grained details of referred objects. Besides, we further align the linguistic and visual information by introducing an elaborate global alignment module for accurately localizing the entire referred objects. The comprehensive experiments conducted on three referring image segmentation datasets illustrate that our proposed two-stage multi-modal interaction framework exhibits a marked superiority over the contemporary state-of-the-art approaches.
- Research Article
51
- 10.1109/tip.2024.3371348
- Jan 1, 2024
- IEEE Transactions on Image Processing
Referring Image Segmentation (RIS) is a fundamental vision-language task that outputs object masks based on text descriptions. Many works have achieved considerable progress for RIS, including different fusion method designs. In this work, we explore an essential question, "What if the text description is wrong or misleading?" For example, the described objects are not in the image. We term such a sentence as a negative sentence. However, existing solutions for RIS cannot handle such a setting. To this end, we propose a new formulation of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the negative sentence inputs besides the regular positive text inputs. To facilitate this new task, we create three R-RIS datasets by augmenting existing RIS datasets with negative sentences and propose new metrics to evaluate both types of inputs in a unified manner. Furthermore, we propose a new transformer-based model, called RefSegformer, with a token-based vision and language fusion module. Our design can be easily extended to our R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves state-of-the-art results on both RIS and R-RIS datasets, establishing a solid baseline for both settings. Our project page is at https://github.com/jianzongwu/robust-ref-seg.
- Research Article
4
- 10.1145/3701733
- Dec 21, 2024
- ACM Transactions on Multimedia Computing, Communications, and Applications
Referring Image Segmentation (RIS) has been extensively studied over the past decade, leading to the development of advanced algorithms. However, there has been a lack of research investigating how existing algorithms should be benchmarked with complex language queries, which include more informative descriptions of surrounding objects and backgrounds (e.g., the black car vs. the black car is parking on the road and beside the bus ). Given the significant improvement in the semantic understanding capability of large pre-trained models, it is crucial to take a step further in RIS by incorporating complex language that resembles real-world applications. To close this gap, building upon the existing RefCOCO and Visual Genome datasets, we propose a new RIS benchmark with complex queries, namely RIS-CQ . The RIS-CQ dataset is of high quality and large scale, which challenges the existing RIS with enriched, specific, and informative queries, and enables a more realistic scenario of RIS research. Besides, we present a niche targeting method to better task the RIS-CQ, called Dual-Modality Graph Alignment ( DuMoGa ) model, which outperforms a series of RIS methods. To provide a valuable foundation for future advancements in the field of RIS with complex queries, we release the datasets, pre-processing and synthetic scripts, and the algorithm implementations at https://github.com/lili0415/DuMoGa .
- Conference Article
1
- 10.1109/gcce53005.2021.9622092
- Oct 12, 2021
We propose a novel text-guided image manipulation method using referring image segmentation. The referring image segmentation aims at segmenting the region relevant to the given text description. By introducing the referring image segmentation into an image manipulation method, our method enables image manipulation in the desired region and reconstruction in the regions irrelevant to the text description. Experimental results show that our text-guided image manipulation method can improve the accuracy compared to existing image manipulation methods.
- Conference Article
10
- 10.1109/iros51168.2021.9636172
- Sep 27, 2021
Humans have a natural ability to effortlessly comprehend linguistic commands such as "park next to the yellow sedan" and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command "park next to the yellow sedan," RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.
- Research Article
10
- 10.1142/s0129065724500643
- Sep 23, 2024
- International journal of neural systems
Referring image segmentation aims to accurately align image pixels and text features for object segmentation based on natural language descriptions. This paper proposes NSNPRIS (convolutional nonlinear spiking neural P systems for referring image segmentation), a novel model based on convolutional nonlinear spiking neural P systems. NSNPRIS features NSNPFusion and Language Gate modules to enhance feature interaction during encoding, along with an NSNPDecoder for feature alignment and decoding. Experimental results on RefCOCO, RefCOCO[Formula: see text], and G-Ref datasets demonstrate that NSNPRIS performs better than mainstream methods. Our contributions include advances in the alignment of pixel and textual features and the improvement of segmentation accuracy.
- Research Article
75
- 10.1109/tmm.2019.2942480
- Sep 26, 2019
- IEEE Transactions on Multimedia
Referring expression is a kind of language expression being used for referring to particular objects. In this paper, we focus on the problem of image segmentation from natural language referring expressions. Existing works tackle this problem by augmenting the convolutional semantic segmentation networks with an LSTM sentence encoder, which is optimized by a pixel-wise classification loss. We argue that the distribution similarity between the inference and ground truth plays an important role in referring image segmentation. Therefore we introduce a complementary loss considering the consistency between the two distributions. To this end, we propose to train the referring image segmentation model in a generative adversarial fashion, which well addresses the distribution similarity problem. In particular, the proposed adversarial semantic guidance network (ASGN) includes the following advantages: a) more detailed visual information is incorporated by the detail enhancement; b) semantic information counteracts the word embedding impact; c) the proposed adversarial learning approach relieves the distribution inconsistencies. Experimental results on four standard datasets show significant improvements over all the compared baseline models, demonstrating the effectiveness of our method.
- Conference Article
6
- 10.1109/icassp43922.2022.9746970
- May 23, 2022
This paper proposes a novel generative adversarial network to improve the performance of image manipulation using natural language descriptions that contain desired attributes. Text-guided image manipulation aims to semantically manipulate an image aligned with the text description while preserving text-irrelevant regions. To achieve this, we newly introduce referring image segmentation into the generative adversarial network for image manipulation. The referring image segmentation aims to generate a segmentation mask that extracts the text-relevant region. By utilizing the feature map of the segmentation mask in the network, the proposed method explicitly distinguishes the text-relevant and irrelevant regions and has the following two contributions. First, our model can pay attention only to the text-relevant region and manipulate the region aligned with the text description. Second, our model can achieve an appropriate balance between the generation of accurate attributes in the text-relevant region and the reconstruction in the text-irrelevant regions. Experimental results show that the proposed method can significantly improve the performance of image manipulation.
- Research Article
4
- 10.1109/tnnls.2023.3281372
- Oct 1, 2024
- IEEE transactions on neural networks and learning systems
Recently, referring image segmentation has attracted wide attention given its huge potential in human-robot interaction. Networks to identify the referred region must have a deep understanding of both the image and language semantics. To do so, existing works tend to design various mechanisms to achieve cross-modality fusion, for example, tile and concatenation and vanilla nonlocal manipulation. However, the plain fusion usually is either coarse or constrained by the exorbitant computation overhead, finally causing not enough understanding of the referent. In this work, we propose a fine-grained semantic funneling infusion (FSFI) mechanism to solve the problem. The FSFI introduces a constant spatial constraint on the querying entities from different encoding stages and dynamically infuses the gleaned language semantic into the vision branch. Moreover, it decomposes the features from different modalities into more delicate components, allowing the fusion to happen in multiple low-dimensional spaces. The fusion is more effective than the one only happening in one high-dimensional space, given its ability to sink more representative information along the channel dimension. Another problem haunting the task is that the instilling of high-abstract semantic will blur the details of the referent. Targetedly, we propose a multiscale attention-enhanced decoder (MAED) to alleviate the problem. We design a detail enhancement operator (DeEh) and apply it in a multiscale and progressive way. Features from the higher level are used to generate attention guidance to enlighten the lower-level features to more attend to the detail regions. Extensive results on the challenging benchmarks show that our network performs favorably against the state-of-the-arts (SOTAs).
- Conference Article
4
- 10.1109/icra48506.2021.9561797
- May 30, 2021
This paper aims to tackle the problem of referring image segmentation, which is targeted at reasoning the region of interest referred by a query natural language sentence. One key issue to address the referring image segmentation is how to establish the cross-modal representation for encoding the two modalities, namely, the query sentence and the input image. Most existing methods are designed to concatenate the features from each modality or to gradually encode the cross-modal representation concerning each word’s effect. In contrast, our approach leverages the correlation between the two modalities for constructing the cross-modal representation. To make the resulting cross-modal representation more discriminative for the segmentation task, we propose a novel mechanism of language-driven attention to encode the cross-modal representation for reflecting the attention between every single visual element and the entire query sentence. The proposed mechanism, named as Language-Driven Attention (LDA), first decouples the cross-modal correlation to channel-attention and spatial-attention and then integrates the two attentions for obtaining the cross-modal representation. The channel attention and the spatial attention respectively reveal how sensitive each channel or each pixel of a particular feature map is with respect to the query sentence. With a proper fusion of the two kinds of feature attention, the proposed LDA model can effectively guide the generation of the final cross-modal representation. The resulting representation is further strengthened for capturing the multi-receptive-field and multi-level-semantic for the intended segmentation. We assess our referring image segmentation model on four public benchmark datasets, and the experimental results show that our model achieves state-of-the-art performance
- Research Article
22
- 10.1016/j.neucom.2024.127599
- Mar 24, 2024
- Neurocomputing
A survey of methods for addressing the challenges of referring image segmentation
- Research Article
2
- 10.1007/s11063-024-11487-2
- Feb 22, 2024
- Neural Processing Letters
Referring image segmentation aims to segment object in an image based on a referring expression. Its difficulty lies in aligning expression semantics with visual instances. The existing methods based on semantic reasoning are limited by the performance of external syntax parser and do not explicitly explore the relationships between visual instances. This article proposes an end-to-end method for referring image segmentation by aligning ’linguistic relationship’ with ’visual relationships’. This method does not rely on external syntax parser for expression parsing. In this paper, the expression is adaptively and structurally parsed into three components: ’subject’, ’object’, and ’linguistic relationship’ by the Semantic Component Parser (SCP) in a learnable manner. Instances Activation Map Module (IAM) locates multiple visual instances based on the subject and object. In addition, the Relationship Based Visual Localization Module (RBVL) firstly enables each instance of the image to learn global knowledge, then decodes the visual relationships between these visual instances, and finally aligns the visual relationships with the linguistic relationships to further accurately locate the target object. The experimental results show that the proposed method improves performance by 4– 9% compared with baseline method on multiple referring image segmentation datasets.
- Research Article
15
- 10.1016/j.neucom.2023.03.011
- Mar 17, 2023
- Neurocomputing
Cross-modal transformer with language query for referring image segmentation
- Research Article
- 10.1145/3777472
- Nov 19, 2025
- ACM Transactions on Multimedia Computing, Communications, and Applications
Recently, Referring Image Segmentation (RIS) frameworks that pair the Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM) have achieved impressive results. However, adapting MLLM to segmentation is computationally intensive, primarily due to visual token redundancy. We observe that traditional patch-wise visual projectors struggle to strike a balance between reducing the number of visual tokens and preserving semantic clarity, often retaining overly long token sequences to avoid performance drops. Inspired by text tokenizers, we propose a novel semantic visual projector that leverages semantic superpixels generated by SAM to identify “visual words” in an image. By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene complexity while minimizing semantic loss in compression. To mitigate loss of information, we propose a semantic superpixel positional embedding to strengthen MLLM’s awareness of superpixel geometry and position, alongside a semantic superpixel aggregator to preserve both fine-grained details inside superpixels and global context outside. Experiments show that our method cuts visual tokens by \(\sim\) 93% without compromising performance, notably speeding up MLLM training and inference, and outperforming existing compressive visual projectors on RIS.
- Conference Article
23
- 10.1145/3474085.3475222
- Oct 17, 2021
Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions and complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents of weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work.In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically, RES retrieves the most relevant image from an external data pool with regard to both the visual and textual similarities, and then enriches the visual information of the referent with the retrieved image for better multimodal feature learning. AMF further enhances the visual detailed information by incorporating the high-resolution feature maps from lower convolution layers of the image. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.
- Book Chapter
3
- 10.1007/978-3-030-60633-6_3
- Jan 1, 2020
Referring image segmentation aims to segment the entity referred by a natural language description. Previous methods tackle this problem by conducting multimodal feature interaction between image and words or sentence only. However, considering only single granularity feature interaction tends to result in incomplete understanding of visual and linguistic information. To overcome this limitation, we propose to conduct multi-granularity multimodal feature interaction by introducing a Word-Granularity Feature Modulation (WGFM) module and a Sentence-Granularity Context Extraction (SGCE) module, which can be complementary in feature alignment and obtain a comprehensive understanding of the input image and referring expression. Extensive experiments show that our method outperforms previous methods and achieves new state-of-the-art performances on four popular datasets, i.e., UNC (+1.45%), UNC+ (+1.63%), G-Ref (+0.47%) and ReferIt (+1.02%).