Local-global coordination with transformers for referring image segmentation

Fang Liu,Yuqiu Kong,Lihe Zhang,Guang Feng,Baocai Yin

doi:10.1016/j.neucom.2022.12.018

Abstract

Referring image segmentation has sprung up benefiting from the outstanding performance of deep neural networks. However, most existing methods explore either local details or the global context of the scene without sufficiently modelling the coordination between them, leading to sub-optimal results. In this paper, we propose a transformer-based method to enforce the in-depth coordination between short- and long-range dependencies in both explicit and implicit fusion processes. Specifically, we design a Cross Modality Transformer (CMT) module with two successive blocks for explicitly integrating linguistic and visual features, which first locates the related visual region in a global view before concentrating on local patterns. Besides, a Hybrid Transformer Architecture (HTA) is utilized as a feature extractor in the encoding stage to capture global relationships and retain local cues. It can further aggregate the multi-modal features in an implicit manner. In the decoding stage, a Cross-level Information Integration module (CI2) is developed to gather information from adjacent levels by dual top-down paths, including a guided filtration path and a residual reservation path. Experimental results show that the proposed method outperforms the state-of-the-art methods on four RIS benchmarks.

Full Text