Abstract

Abstract. Semantic segmentation is a fundamental research task in computer vision, which intends to assign a certain category to every pixel. Currently, most existing methods only utilize the deepest feature map for decoding, while high-level features get inevitably lost during the procedure of down-sampling. In the decoder section, transposed convolution or bilinear interpolation was widely used to restore the size of the encoded feature map; however, few optimizations are applied during up-sampling process which is detrimental to the performance for grouping and classification. In this work, we proposed a dual pyramids encoder-decoder deep neural network (DPEDNet) to tackle the above issues. The first pyramid integrated and encoded multi-resolution features through sequentially stacked merging, and the second pyramid decoded the features through dense atrous convolution with chained up-sampling. Without post-processing and multi-scale testing, the proposed network has achieved state-of-the-art performances on two challenging benchmark image datasets for both ground and aerial view scenes.

Highlights

  • Semantic image segmentation is a dense classification task for image understanding, which has many practical applications such as autonomous driving and augmented reality devices

  • Following the common procedure of semantic segmentation, we reported the precision, recall and mean Intersection over Union (IoU)

  • On figure 2, the visualization images show that our proposed DPEDNet enables to accurately detect and segment the objects in various scales, complicated scene and very challenging illuminate situation

Read more

Summary

Introduction

Semantic image segmentation is a dense classification task for image understanding, which has many practical applications such as autonomous driving and augmented reality devices. FCN-based architectures (Ronneberger et al, 2015; Badrinarayanan et al, 2017; Treml et al, 2016; Jiang et al, 2019; Jiang et al, 2020) utilized several pooling layers to extract high-level features and restored the extracted feature map to original resolution through transposed convolution. Atrous convolution (Holschneideret al., 1990) with various dilation rates are utilized to extract features in parallel This kind of pyramid structure is effective in multi-scale feature extraction and can enhance the ability to classify and group ambiguous objects, it only captures contextual information from the deepest feature map by conducting a context module after the encoding stage. We hold the view that the contextual information in early and middle stages can be further extracted to enhance feature extraction

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call