Automatic Dense Annotation for Monocular 3D Scene Understanding

Md Alimoor Reza,Kai Chen,Soon-Heung Jung,David J Crandall,Akshay Naik

doi:10.1109/access.2020.2984745

Abstract

Deep neural networks have revolutionized many areas of computer vision, but they require notoriously large amounts of labeled training data. For tasks such as semantic segmentation and monocular 3d scene layout estimation, collecting high-quality training data is extremely laborious because dense, pixel-level ground truth is required and must be annotated by hand. In this paper, we present two techniques for significantly reducing the manual annotation effort involved in collecting large training datasets. The tools are designed to allow rapid annotation of entire videos collected by RGBD cameras, thus generating thousands of ground-truth frames to use for training. First, we propose a fully-automatic approach to produce dense pixel-level semantic segmentation maps. The technique uses noisy evidence from pre-trained object detectors and scene layout estimators and incorporates spatial and temporal context in a conditional random field formulation. Second, we propose a semi-automatic technique for dense annotation of 3d geometry, and in particular, the 3d poses of planes in indoor scenes. This technique requires a human to quickly annotate just a handful of keyframes per video, and then uses the camera poses and geometric reasoning to propagate these labels through an entire video sequence. Experimental results indicate that the technique could be used as an alternative or complementary source of training data, allowing large-scale data to be collected with minimal human effort.

Highlights

Understanding the semantic, three-dimensional structure of the visual world is a fundamental problem in computer vision, with innumerable applications ranging from automatic photo retrieval to autonomous vehicles
We explore how to collect large-scale training data with minimal human interaction for these two tasks: semantic segmentation and 3d room layout from single images
Once H angular is computed, we find the connected components from this matrix using a Depth First Search (DFS)

Summary

Introduction

Understanding the semantic, three-dimensional structure of the visual world is a fundamental problem in computer vision, with innumerable applications ranging from automatic photo retrieval to autonomous vehicles. When a photograph is taken, the projective transformation ‘‘converts’’ a 3d scene into a 2d image, throwing away most explicit cues about depths of points in the scene. We use a variety of cues including perspective, relative object size and position, shadows, etc., combined with intuition from a lifetime of experience about the world [1]. Encoding this reasoning into an automatic algorithm has been a long-standing goal of computer vision, but has proven difficult: human-level performance requires not just low-level image cues, and higherlevel semantic cues: identifying objects, reasoning about their typical relationships, applying the laws of nature, etc

Objectives

Methods

Results

Conclusion