We propose a concise and consistent network focusing on multi-task learning of Referring Expression Comprehension (REC) and Segmentation (RES) within Visual grounding (VG). To simplify the model architecture and achieve parameter sharing, we reconstruct the Visual grounding task as a floating-point coordinate generation problem based on both image and text inputs. Consequently, rather than separately predicting bounding boxes and pixel-level segmentation masks, we represent them uniformly as a sequence of coordinate tokens and output two corner points of bounding boxes and polygon vertices autoregressively. To improve the accuracy of point generation, we introduce a regression-based decoder. Inspired by bilinear interpolation, this decoder can directly predict precise floating-point coordinates, thus avoiding quantization errors. Additionally, we devise a Multi-Modal Interaction Fusion (M2IF) to address the imbalance between visual and language features in the model. This module focuses visual information on regions relevant to textual descriptions while suppressing the influence of irrelevant areas. Based on our model, Visual grounding is realized through a unified network structure. Experiments conducted on five benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, ReferItGame and Flickr30K Entities) demonstrate that the proposed unified network outperforms or is on par with many existing task-customized models. Codes are available at https://github.com/LFUSST/MMI-VG.
Read full abstract