Abstract
The main challenge for scene parsing arises when complex scenes with highly diverse objects are encountered. The objects not only differ in scale and appearance but also in semantics. Previous works focus on encoding the multi-scale contextual information (via pooling or atrous convolutions) generally on top of compact high-level features (i.e., at a single stage). In this work, we argue that a rich set of cues exist at multiple stages of the network, encapsulating low, mid and high-level scene details. Therefore, an optimal scene parsing model must aggregate multi-scale context at all three levels of the feature hierarchy; a capability that lacks in state-of-the-art scene parsing models. To address this limitation, we introduce a novel architecture with three new blocks that systematically aggregate low, mid and high tier features. The heart of our approach is a high-level feature aggregation module that augments sparsely connected atrous convolution with dense local and layer-wise connections to avoid gridding artifacts. Besides, we employ a novel feature pyramid augmentation and semantic refinement unit to generate low- and mid-level features that are mixed with high-level features at the decoder. We extensively evaluate our proposed approach on the large-scale Cityscapes and ADE2K benchmarks. Our approach surpasses many latest models on both datasets, achieving mean intersection-over-union (mIoU) scores of 80.5% and 44.0% on Cityscapes and ADE20K, respectively.
Highlights
Given an image, the goal of semantic segmentation is to assign a category label to each pixel [1], [2]
The Atrous Spatial Pyramid Pooling (ASPP) module in DeepLabv2 [8] & v3 [9] applies parallel atrous convolutions with different dilation rates to extract multiscale context information, they work on a high-level feature representation
CONTRIBUTIONS We propose and approach with the following main contributions: (a) We propose a feature pyramid based augmentation module to efficiently generate refined low-level features to preserve the local details. (b) For mid-level multiscale feature fusion, we propose a semantic refinement unit that combines a diverse set of features from the network encoder. (c) The central component of our model is a highlevel context aggregation block
Summary
The goal of semantic segmentation is to assign a category label to each pixel [1], [2]. PSPNet [7] applies pooling operations with different sub-sampling rates, all arranged in parallel, to capture context information Their pyramid pooling module works only on last convolutional layer features, that generally lacks local scene details. The Atrous Spatial Pyramid Pooling (ASPP) module in DeepLabv2 [8] & v3 [9] applies parallel atrous convolutions with different dilation rates to extract multiscale context information, they work on a high-level feature representation. It is based on the insight that dilated convolution expands kernel size by interleaving its weights with zeros, that equates to dropping the intermediate activations in the input feature map To alleviate this problem, we propose to combine the strengths of dilated (sparse) and wider (dense) kernels, that eventually enhance the discriminative power of the network and avoids unfairly neglecting the local information, as is the case with atrous convolution. It achieves 80.5% and 44.0% mIoU scores on Cityscapes and ADE20K datasets, respectively, outperforming the best reported results in [7], [17]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.