Abstract

Referring image segmentation has sprung up benefiting from the outstanding performance of deep neural networks. However, most existing methods explore either local details or the global context of the scene without sufficiently modelling the coordination between them, leading to sub-optimal results. In this paper, we propose a transformer-based method to enforce the in-depth coordination between short- and long-range dependencies in both explicit and implicit fusion processes. Specifically, we design a Cross Modality Transformer (CMT) module with two successive blocks for explicitly integrating linguistic and visual features, which first locates the related visual region in a global view before concentrating on local patterns. Besides, a Hybrid Transformer Architecture (HTA) is utilized as a feature extractor in the encoding stage to capture global relationships and retain local cues. It can further aggregate the multi-modal features in an implicit manner. In the decoding stage, a Cross-level Information Integration module (CI2) is developed to gather information from adjacent levels by dual top-down paths, including a guided filtration path and a residual reservation path. Experimental results show that the proposed method outperforms the state-of-the-art methods on four RIS benchmarks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.