3SP-Net: Semantic Segmentation Network with Stereo Image Pairs for Urban Scene Parsing

Lingli Zhou,Haofeng Zhang

doi:10.1007/978-3-319-97304-3_39

Abstract

Image semantic segmentation has a wide range of applications, especially scene parsing. During the past several years, Convolutional Neural Networks (CNNs) have been shown to have a great potential in image semantic segmentation of scenes. However, due to the existence of invariance in CNN, images will lose structural details which are important to elaborate semantic segmentation results during the feature extraction stage. We are not committed to seeking methods to restore these hard-to-restore structural information, instead, we propose a novel method namely Semantic Segmentation Network with Stereo Image Pairs (3SP-Net), which utilise pairwised stereo images to generate segmentation results, and add an Adversarial Network (GAN) to make the generated maps more similar to the real ones. 3SP-Net computes a pair of left and right stereo features which provides additional information about the 3D structure of the physical environment to compensate for the loss of structural information in 2D images, to improve the performance of semantic segmentation. Furthermore, we adopt adversarial training to enhance the high-order consistency between results generated by the image semantic segmentation network and ground-truth segmentation maps. Experiments on Cityscapes show that the performances with the assistance of depth features can be improved greatly to the widely-used architectures such as Fully Convolutional Network (FCN) and DeepLab.

Full Text