Cross-scale sampling transformer for semantic image segmentation

Yizhe Ma,Fangjian Lin,Shengwei Tian,Long Yu

doi:10.3233/jifs-220976

Abstract

In increasingly complex scenes, multi-scale information fusion becomes more and more critical for semantic image segmentation. Various methods are proposed to model multi-scale information, such as local to global, but this is not enough for the scene changes more and more, and the image resolution becomes larger and larger. Cross-Scale Sampling Transformer is proposed in this paper. We first propose that each scale feature is sparsely sampled at one time, and all other features are fused, which is different from all previous methods. Specifically, the Channel Information Augmentation module is first proposed to enhance query feature features, highlight part of the response to sampling points and enhance image features. Next, the Multi-Scale Feature Enhancement module performs a one-time fusion of full-scale features, and each feature can obtain information about other scale features. In addition, the Cross-Scale Fusion module is used for cross-scale fusion of query feature and full-scale feature. Finally, the above three modules constitute our Cross-Scale Sampling Transformer(CSSFormer). We evaluate our CSSFormer on four challenging semantic segmentation benchmarks, including PASCAL Context, ADE20K, COCO-Stuff 10K, and Cityscapes, achieving 59.95%, 55.48%, 50.92%, and 84.72% mIoU, respectively, outperform the state-of-the-art.

Full Text