Scale-aware spatial pyramid pooling with both encoder-mask and scale-attention for semantic segmentation

Feng Zhou,Yong Hu,Xukun Shen

doi:10.1016/j.neucom.2019.11.042

Abstract

This paper focuses on semantic segmentation of scenes by capturing the appropriate scale, rich detail and contextual dependencies in a feature representation. Semantic segmentation is a pixel-level classification task and has made steady progress on the basis of fully convolutional networks (FCNs). However, we find there is still room for improvement in the following aspects. The first is that the pixel itself has not enough information for semantic prediction, it needs to look around to determine which category it belongs to. However, the fixed size of the receptive field defined by the network is not suitable for all pixels when an image contains objects with various scales. The second is that the extracted scale-aware features do not handle sharper object boundaries due to low-resolution. The final aspect is regarding the ability to model long-range dependencies. In order to solve the above challenges, in this paper we propose three modules: Scale-aware spatial pyramid pool module, Encoder mask module and Scale-attention module (SSPP-ES). Extensive experiments on the Cityscapes and ADE20K benchmarks demonstrate the effectiveness of our approach for semantic segmentation.

Full Text