Abstract Real-time semantic segmentation is widely used in various domains such as autonomous driving. Most real-time semantic segmentation networks adopt an encoder-decoder structure. During encoding, feature maps undergo multiple downsampling stages to enhance computational efficiency and enlarge the receptive field. However, this process may lead to the loss of detailed information, which may not be fully recovered during upsampling. Moreover, upsampling semantic features may cause blurring due to the lack of spatial detail information. To mitigate these issues, we use spatially guided cross-resolution self-attention to improve the upsampling of the semantic features by supplementing detailed information from the spatial branch. Furthermore, we incorporate an inductive bias into the cross-resolution attention mechanism to enhance its ability to learn generalized features. Additionally, we design a semantic feature extraction block, and a spatial feature extraction branch to construct a lightweight backbone. The results on Cityscapes and CamVid show that the proposed model achieves a good balance between accuracy and efficiency. Specifically, it obtains 74.2 and 71.5 mIoU on the two test datasets with 1.31M parameters, respectively. Code is available on https://github.com/clearwater753/DECENet.
Read full abstract