SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.

Vijay Badrinarayanan,Alex Kendall,Roberto Cipolla

doi:10.1109/tpami.2016.2644615

Vijay Badrinarayanan, Alex Kendall + Show 1 more

Open Access

https://doi.org/10.1109/tpami.2016.2644615

Copy DOI

Abstract

We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet.

Highlights

SEMANTIC segmentation has a wide array of applications ranging from scene understanding, inferring supportrelationships among objects to autonomous driving
From an overall efficiency viewpoint, we feel less attention has been paid to smaller and more memory, time efficient models for real-time applications such as road scene understanding and augmented reality (AR). This was the primary motivation behind the proposal of SegNet, which is significantly smaller and faster than other competing architectures, but which we have shown to be efficient for tasks such as road scene understanding
The metrics we chose to benchmark various deep segmentation architectures like the boundary F1-measure (BF) was done to complement the existing metrics which are more biased towards region accuracies

Summary

Introduction

SEMANTIC segmentation has a wide array of applications ranging from scene understanding, inferring supportrelationships among objects to autonomous driving. There is an active interest for semantic pixel-wise labelling [2], [3], [4],[6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. The results, very encouraging, appear coarse [3] This is primarily because max pooling and sub-sampling reduce feature map resolution. Our motivation to design SegNet arises from this need to map low resolution features to input resolution for pixel-wise classification. This mapping must produce features which are useful for accurate boundary localization

Objectives

Methods

Findings

Discussion

Conclusion