Incremental and Multi-Task Learning Strategies for Coarse-To-Fine Semantic Segmentation

Mazen Mel,Umberto Michieli,Pietro Zanuttigh

doi:10.3390/technologies8010001

Abstract

The semantic understanding of a scene is a key problem in the computer vision field. In this work, we address the multi-level semantic segmentation task where a deep neural network is first trained to recognize an initial, coarse, set of a few classes. Then, in an incremental-like approach, it is adapted to segment and label new objects’ categories hierarchically derived from subdividing the classes of the initial set. We propose a set of strategies where the output of coarse classifiers is fed to the architectures performing the finer classification. Furthermore, we investigate the possibility to predict the different levels of semantic understanding together, which also helps achieve higher accuracy. Experimental results on the New York University Depth v2 (NYUDv2) dataset show promising insights on the multi-level scene understanding.

Highlights

The semantic understanding of a scene is a long standing problem in the computer vision field that can be approached at different levels of interpretations
Instances of particular classes are identified by means of a bounding box which surrounds the objects and a label is assigned to each instance
The network consists of a Xception feature extractor, whose weights were pre-trained [28] on the Pascal VOC 2012 dataset [29], and a decoder made by Atrous Spatial Pyramid Pooling (ASPP) layers

Summary

Introduction

The semantic understanding of a scene is a long standing problem in the computer vision field that can be approached at different levels of interpretations. In other settings, a coarse set of classes could be predicted first and the set of classes could hierarchically grow into more refined categories to better understand the semantic context. To visualize this scenario, imagine an indoor navigation system first trained on a very coarse set of labels to segment, e.g. movable objects, permanent structures, and furniture, in order to, e.g., avoid obstacles. The dataset used for the initial training could be refined with a more fine-grained set of semantic classes (e.g., the movable objects class could be split into books, monitor, etc) and the task of the robotic system is to interact with these new types of objects. One solution could be to retrain from scratch the underlying neural network with the new set of classes; some other solutions may seem more reasonable

Objectives

Methods

Results

Conclusion