Abstract

We propose Depth-to-Space Net (DTS-Net), an effective technique for semantic segmentation using the efficient sub-pixel convolutional neural network. This technique is inspired by depth-to-space (DTS) image reconstruction, which was originally used for image and video super-resolution tasks, combined with a mask enhancement filtration technique based on multi-label classification, namely, Nearest Label Filtration. In the proposed technique, we employ depth-wise separable convolution-based architectures. We propose both a deep network, that is, DTS-Net, and a lightweight network, DTS-Net-Lite, for real-time semantic segmentation; these networks employ Xception and MobileNetV2 architectures as the feature extractors, respectively. In addition, we explore the joint semantic segmentation and depth estimation task and demonstrate that the proposed technique can efficiently perform both tasks simultaneously, outperforming state-of-art (SOTA) methods. We train and evaluate the performance of the proposed method on the PASCAL VOC2012, NYUV2, and CITYSCAPES benchmarks. Hence, we obtain high mean intersection over union (mIOU) and mean pixel accuracy (Pix.acc.) values using simple and lightweight convolutional neural network architectures of the developed networks. Notably, the proposed method outperforms SOTA methods that depend on encoder–decoder architectures, although our implementation and computations are far simpler.

Highlights

  • Received: 2 December 2021Semantic segmentation is an important task in computer vision, as it constitutes pixel-wise classification of an image to mask each object in the scene

  • We propose the Depth-to-Space Net (DTS-Net) deep network for semantic segmentation using a sub-pixel convolutional neural networks (CNNs), to address the semantic segmentation complexity problem that arises for encoder–decoder architectures with or without attention methods while retaining high segmentation accuracy

  • We report the noteworthy results obtained with the proposed method. These results indicate the high accuracy of both our semantic segmentation approach and the joint task with depth estimation

Read more

Summary

Introduction

Semantic segmentation is an important task in computer vision, as it constitutes pixel-wise classification of an image to mask each object in the scene. Recent studies on semantic segmentation have achieved promising results using convolutional neural networks (CNNs), encoder–decoder CNN architectures. In such architectures, the semantic segmentation task is modeled in two stages: the encoding stage, in which the image is down-sampled to obtain its deep semantic features, and the decoding stage, in which the semantic features are up-sampled to obtain a semantic segmentation mask of the same size as the input image. Encoder–decoder architecture can achieve highly accurate segmentation results; the decoder stage adds considerable computational complexity to the overall model and the input image size of such models is usually small.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call