Abstract

Benefiting from the booming of deep learning, the state-of-the-art models achieved great progress. But they are huge in terms of parameters and floating point operations, which makes it hard to apply them to real-time applications. In this paper, we propose a novel deep neural network architecture, named MPDNet, for fast and efficient semantic segmentation under resource constraints. First, we use a light-weight classification model pretrained on ImageNet as the encoder. Second, we use a cost-effective upsampling datapath to restore prediction resolution and convert features for classification into features for segmentation. Finally, we propose to use a multi-path decoder to extract different types of features, which are not ideal to process inside only one convolutional neural network. The experimental results of our model outperform other models aiming at real-time semantic segmentation on Cityscapes. Based on our proposed MPDNet, we achieve 76.7% mean IoU on Cityscapes test set with only 118.84GFLOPs and achieves 37.6 Hz on 768 × 1536 images on a standard GPU.

Highlights

  • The purpose of semantic segmentation is to predict the category label of each pixel in an image, which has always been a basic problem in computer vision

  • Semantic segmentation models need to be accurate and should be efficient in order to apply them to real-time applications

  • We choose a light-weight pretrained on ImageNet classification model as our backbone in order to realize real-time inference

Read more

Summary

Introduction

The purpose of semantic segmentation is to predict the category label of each pixel in an image, which has always been a basic problem in computer vision. With the deepening of research, the performance of semantic segmentation models has been greatly improved This has promoted the development of many practical applications such as autonomous driving [1], medical image analysis and virtual reality [2]. State of the art semantic segmentation models modified the downsampling layers in their backbones. This makes the feature maps output by backbones usually 1/8 of the original image size. Such models need huge time and GPU memory during training and inference. We choose a light-weight pretrained on ImageNet classification model as our backbone in order to realize real-time inference

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.