Abstract

This paper proposes a novel 3D representation, namely, a latent 3D volume, for joint depth estimation and semantic segmentation. Most previous studies encoded an input scene (typically given as a 2D image) into a set of feature vectors arranged over a 2D plane. However, considering the real world is three-dimensional, this 2D arrangement reduces one dimension and may limit the capacity of feature representation. In contrast, we examine the idea of arranging the feature vectors in 3D space rather than in a 2D plane. We refer to this 3D volumetric arrangement as a latent 3D volume. We will show that the latent 3D volume is beneficial to the tasks of depth estimation and semantic segmentation because these tasks require an understanding of the 3D structure of the scene. Our network first constructs an initial 3D volume using image features and then generates latent 3D volume by passing the initial 3D volume through several 3D convolutional layers. We apply depth regression and semantic segmentation by projecting the latent 3D volume onto a 2D plane. The evaluation results show that our method outperforms previous approaches on the NYU Depth v2 dataset.

Highlights

  • Semantic 3D scene reconstruction, in which a scene is geometrically and semantically analyzed, is one of the challenging and crucial problems in the field of computer vision

  • Kendall et al [42] demonstrated the best performance in terms of depth estimation error metrics, they did not use the same weights for depth estimation and semantic segmentation, and the segmentation performance was relatively poor compared to other approaches [15,16]

  • We discuss how the proposed latent 3D volume is beneficial to the tasks of depth estimation and semantic segmentation

Read more

Summary

Introduction

Semantic 3D scene reconstruction, in which a scene is geometrically and semantically analyzed, is one of the challenging and crucial problems in the field of computer vision. Understanding a surrounding environment is valuable for many applications such as remote sensing, robotics, augmented reality, and human-computer interaction. Studies tackled this problem using a combination of 3D reconstruction and image-based recognition techniques. Estimating depth and semantic labels from 2D images is a crucial step for precise semantic 3D scene reconstruction. Convolutional Neural Networks (CNNs) have achieved tremendous results in tasks such as depth estimation [5,6,7,8] and semantic segmentation [9,10,11,12]. Standard CNNs apply a set of 2D convolutions to an input image to acquire such information in a 2D image plane

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.