Chapter 3 - Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks

Zoltan Koppanyi,Dorota Iwaszczuk,Bing Zha,Can Jozef Saul,Charles K Toth,Alper Yilmaz

doi:10.1016/b978-0-12-817358-9.00009-3

Abstract

Semantic segmentation has been an active field in computer vision and photogrammetry communities for over a decade. Pixel-level semantic labeling of images is generally achieved by assigning labels to pixels using machine learning techniques. Among others, the encoder–decoder convolutional neural networks have become the baseline approach for this problem recently. The majority of papers on this topic use only RGB images as input, despite the availability of other data sources, such as depth, which can improve segmentation and labeling. In this chapter, we investigate a number of encoder–decoder CNN architectures for semantic labeling, where the depth data is fused with the RGB data using three different approaches: (1) fusion with RGB image through color space transformation, (2) stacking depth images and RGB images, and (3) using Siamese network structures, such as FuseNet or VNet. The chapter also presents our approach to using surface normals in place of depth data. The advantage of using surface normal representation is to introduce viewpoint independency, meaning that the direction of a surface normal vector remains the same when the camera pose changes. This is a clear advantage over raw depth data where the depth value for a single scene point changes when the camera moves. The chapter provides a comprehensive analysis of three different fusion approaches using SegNet, FuseNet and VNet deep learning architectures. The analysis is conducted on both the Stanford 2D-3D-Semantics indoor dataset and aerial images from the ISPRS's Vaihingen dataset. Depth images of the Stanford dataset are acquired directly by flash LiDAR, and the ISPRS dataset depth images are generated by dense 3D reconstruction. We show that the surface normal representation better generalizes to different scenes. In our experiments, using surface normals with FuseNet achieved 5% improvement compared to using depth, and this resulted in 81.5% global accuracy on the Stanford dataset.

Full Text