Improving Depth Estimation by Embedding Semantic Segmentation: A Hybrid CNN Model.

José E Valdez-Rodríguez,Hiram Calvo,Edgardo Felipe-Riverón,Marco A Moreno-Armendáriz

doi:10.3390/s22041669

José E Valdez-Rodríguez, Hiram Calvo + Show 2 more

Open Access

https://doi.org/10.3390/s22041669

Copy DOI

Journal: Sensors (Basel, Switzerland)	Publication Date: Feb 21, 2022
Citations: 12	License type: CC BY 4.0

Affiliation: Instituto Politécnico Nacional

Abstract

Single image depth estimation works fail to separate foreground elements because they can easily be confounded with the background. To alleviate this problem, we propose the use of a semantic segmentation procedure that adds information to a depth estimator, in this case, a 3D Convolutional Neural Network (CNN)—segmentation is coded as one-hot planes representing categories of objects. We explore 2D and 3D models. Particularly, we propose a hybrid 2D–3D CNN architecture capable of obtaining semantic segmentation and depth estimation at the same time. We tested our procedure on the SYNTHIA-AL dataset and obtained , which is an improvement of 0.14 points (compared with the state of the art of ) by using manual segmentation, and using automatic semantic segmentation, proving that depth estimation is improved when the shape and position of objects in a scene are known.

Highlights

Depth estimation from a single image consists of calculating the distance between the objects in an image to the user’s point of view
We explored 2D, 3D Convolutional Neural Networks (CNN) models, and a hybrid 2D–3D CNN model capable of obtaining semantic segmentation and depth estimation at the same time
We found that helping the CNN models with additional information such as One-Hot Encoded Semantic Segmentation, aids in separating objects, and obtaining a better depth estimation: Knowing the shape and position of objects in the scene, a CNN can estimate their depth distance with greater accuracy

Summary

Introduction

Depth estimation from a single image consists of calculating the distance between the objects in an image to the user’s point of view. This distance is calculated through a pair of images obtained from both eyes (binocular vision) by using the overlap between the field of view of both eyes [1]. Once we identify the objects in the image, we proceed to estimate depth using this information To carry out these objectives, we propose the use of 2D and 3D Convolutional Neural Networks (CNN) trained with a synthetic dataset, containing both semantic segmentation and depth information, as well to explore a hybrid 2D–3D CNN model capable of estimating depth from a single image, while at the same time, segment objects found in it

Methods

Results

Discussion

Conclusion