Investigating Semantic Augmentation in Virtual Environments for Image Segmentation Using Convolutional Neural Networks.

Joshua Ganter,Katharina Ußling,Ron Metzger,Simon Löffler,Christoph Müller

doi:10.3390/jimaging7080146

Abstract

Collecting real-world data for the training of neural networks is enormously time-consuming and expensive. As such, the concept of virtualizing the domain and creating synthetic data has been analyzed in many instances. This virtualization offers many possibilities of changing the domain, and with that, enabling the relatively fast creation of data. It also offers the chance to enhance necessary augmentations with additional semantic information when compared with conventional augmentation methods. This raises the question of whether such semantic changes, which can be seen as augmentations of the virtual domain, contribute to better results for neural networks, when trained with data augmented this way. In this paper, a virtual dataset is presented, including semantic augmentations and automatically generated annotations, as well as a comparison between semantic and conventional augmentation for image data. It is determined that the results differ only marginally for neural network models trained with the two augmentation approaches.

Highlights

With the rise of convolutional neural networks, computer vision tasks became solvable with deep learning approaches
The basic idea is to investigate the impact of semantic augmentation of synthetic image data on training an artificial neural network for semantic segmentation compared to conventional augmentation
Eleven different models of AdapNet are obtained. These eleven models consist of two models per augmentation strength category plus one model with strong augmentations trained on a combined dataset of conventional and semantic augmentations

Summary

Introduction

With the rise of convolutional neural networks, computer vision tasks became solvable with deep learning approaches. A flavor of image classification that performs a per-pixel classification, is used heavily in various areas, including scene understanding for autonomous driving [1,2]. The goal of this per-pixel classification is to gain an understanding of the scene’s semantics. With multiple neighboring pixels representing the same class, areas can be defined where one or more objects of the respective class are located. This information can be analyzed further: If the point of view is known, a processor can calculate the 3D location of the object(s) and decide on a course of action. For example, the recognition of the environment based on images provided by a camera attached to the front of a train is investigated

Methods

Results

Conclusion