Mixed-scale cross-modal fusion network for referring image segmentation

Xiong Pan,Xuemei Xie,Jianxiu Yang

doi:10.1016/j.neucom.2024.128793

Xiong Pan, Xuemei Xie + Show 1 more

https://doi.org/10.1016/j.neucom.2024.128793

Copy DOI

Export

Save

Cite

Journal: Neurocomputing

Publication Date: Oct 26, 2024

Abstract
Full-Text
Similar Papers

Abstract

Listen

Referring image segmentation aims to segment the target by a given language expression. Recently, the bottom-up fusion network utilizes language features to highlight the most relevant regions during the visual encoder stage. However, it is not comprehensive that establish only the relationship between pixels and words. To alleviate this problem, we propose a mixed-scale cross-modal fusion method that widens the interaction between vision and language. Specially, at each stage, pyramid pooling is used to augment visual perception and improve the interaction between visual and linguistic features, thereby highlighting relevant regions in the visual data. Additionally, we employ a simple multi-scale feature fusion module to effectively combine multi-scale aligned features. Experiments conducted on Standard RIS benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the- art approaches. Moreover, we conducted experiments on different visual backbones respectively, and the proposed method yielded better and significantly improved performance results.

Full Text