Abstract
Estimation of the distance of objects present in the surrounding environment from the visual sensor is the classical research problem in Computer Vision and related areas. The traditional approaches that have been used for depth estimation generally used a pair of images captured through a stereo camera to obtain a disparity map through triangulation method. Later on, researchers have proposed and developed deep learning based methods that can generate the depth map using only a RGB image captured through a monocular camera. However, the recent approaches based on deep convolutional neural networks have shown remarkable development with feasible results. The CNN architecture used for depth estimation consists of two components: Encoder and Decoder. The encoder network is responsible for the extraction of dense features from the input RGB image and the decoder network is used for the upsampling of the encoded depth map. In this paper, we first analyze the depth estimation results obtained using three different pretrained CNN models which are used as an encoder in the depth estimation network, then in order to improve the results further, we leverage the ensemble learning technique, in which these depth estimation models are combined to form an ensemble and the average of all the depths predicted by the individual models in the ensemble is considered as the prediction of the ensemble. The results presented in the paper have shown that the proposed scheme of forming the ensemble of individual encoder-decoder based depth estimation models outperforms the existing benchmark monocular depth estimation methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have