Multi-Scale Convolutional Features Network for Semantic Segmentation in Indoor Scenes

Yanran Wang,Shilang Chen,Qingliang Chen,Junjun Wu

doi:10.1109/access.2020.2993570

Abstract

Semantic segmentation is one of the most fundamental techniques for visual intelligence, which plays a vital role for indoor service robotic tasks such as scene understanding, autonomous navigation and dexterous manipulation. However, semantic segmentation of indoor environments poses great challenges for existing segmentation techniques due to the complex overlaps, heavy occlusions and cluttered scenes with objects of different shapes and scales, which may lead to the loss of edge information and insufficient segmentation accuracy. And most of the semantic segmentation networks are very complex and cannot be applied to mobile robot platforms. Thus, it is of significant importance for ensuring as few network parameters as possible while improving the detection of meaningful edges in indoor scenes. In this paper, we present a novel indoor scene semantic segmentation method that can refine the segmentation edges and achieve a balance between accuracy and model complexity for indoor service robots. Our approach systematically incorporates dilated convolution and rich convolutional features from the intermediate layers of Convolutional Neural Networks (CNN), which is based on two motivations: (1) The middle hidden layer of CNN contains a lot of potentially useful information for better edge detection which is, however, no longer present in latter layers in traditional structures. (2) The dilated convolution can change the size of receptive field and obtain multi-scale feature information without losing the resolution and introducing any additional parameters. Thus we propose a new end-to-end Multi-Scale Convolutional Features (MSCF) network to integrate the dilated convolution and rich convolutional features extracted from the intermediate layers of traditional CNN. Finally, the resulting approach is extensively evaluated on the prestigious indoor image datasets of SUN RGB-D and NYUDv2, and shows promising improvements over state-of-the-art baselines, both qualitatively and quantitatively.

Highlights

Semantic segmentation is one of the most essential driving techniques to visual intelligence, which is defined as a segmentation that classifies each pixel according to the semantic content expressed by each pixel in an image
This paper aims to address this challenge, proposing a new Multi-Scale Convolutional Features neural network for semantic segmentation of indoor service robots, who may encounter a typical scene shown in Figure 1 where there are a variety of unstructured and occluded objects with different contours and sizes
Our framework is based on the pre-trained model of VGG-16 [8] or ResNet-101 [9], and we introduce a new module for extracting useful features in middle layers of Convolutional Neural Networks (CNN), inspired by the Richer Convolutional Features (RCF) [10] model for edge detection

Summary

Introduction

Semantic segmentation is one of the most essential driving techniques to visual intelligence, which is defined as a segmentation that classifies each pixel according to the semantic content expressed by each pixel in an image. It is the most difficult and fundamental task in the current understanding of the scene. Great progress has been made based on the powerful CNN structures [1]–[6] Those CNN-based models have pushed semantic segmentation to a new high level against the traditional ones. Unstructured targets, irregular shapes and object occlusion in images from complex real environments still pose great challenges to existing semantic segmentation approaches, greatly restricting their applicability

Objectives

Methods

Findings

Conclusion