Abstract

Referring image segmentation aims to segment the target by a given language expression. Recently, the bottom-up fusion network utilizes language features to highlight the most relevant regions during the visual encoder stage. However, it is not comprehensive that establish only the relationship between pixels and words. To alleviate this problem, we propose a mixed-scale cross-modal fusion method that widens the interaction between vision and language. Specially, at each stage, pyramid pooling is used to augment visual perception and improve the interaction between visual and linguistic features, thereby highlighting relevant regions in the visual data. Additionally, we employ a simple multi-scale feature fusion module to effectively combine multi-scale aligned features. Experiments conducted on Standard RIS benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the- art approaches. Moreover, we conducted experiments on different visual backbones respectively, and the proposed method yielded better and significantly improved performance results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.