Abstract

Convolution neural network (CNN) has local receptive field and struggles to build long-range spatial dependency, while vision transformer (ViT) has the capacity to capture long-range context dependency but suffers from local detailed feature loss during the token embedding procedure. Aggregating the apparent features and context information is helpful for semantic segmentation. In this paper, the roles of low-level apparent features and context information in semantic segmentation are carefully analyzed, and layer attention module is proposed to finely aggregate low-level features and context information. First, we propose various CNN branches to extract shallow features from an input image, such as edge, texture. Meanwhile, we use ViT backbone to extract rich context information. Second, we integrate CNN branch and ViT in a united network, and propose a layer attention module to fuse the context information and low-level detailed features. Based on the united network, which implies ViT enhanced with low-level convolution, the accurate semantic segmentation is achieved. We test our method on public Cityscapes datasets. Numerate experiments shows our method achieves competitive results. Code is available at: https://github.com/cocolord/Degraded_image_segmentation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call