Abstract
Recently, unsupervised monocular training methods based on convolutional neural networks have already shown surprisingly progress in improving the accuracy of depth estimation. However, the performance of these methods suffers deeply from problematic pixels such as occluded pixels, low-texture pixels, and so on. In this paper, we introduce a method to a mask by the statistic of error maps for segmenting the problematic pixels. Different from the conventional methods which use additional segmentation networks to classify problematic pixels, we use a multi-task learning architecture to generate identical mask, mean mask, and variance mask for filtering the problematic pixels. Experimental results show that our proposed method has satisfactory performance compared with other relative methods on the KITTI dataset. Moreover, we also apply our method to the UAV dataset VisDrone, and the results also indicate the effectiveness of the method in detecting moving objects.
Highlights
I FERRING the accurate depth information from a single image has potential applications in 3D reconstruction, robotics, scene understanding, etc
Impressive progress has been made in improving the performance of monocular depth estimation through a color image by training a deep network
Since our statistical masks are mainly obtained on error maps, we introduce the concept of error vectors
Summary
I FERRING the accurate depth information from a single image has potential applications in 3D reconstruction, robotics, scene understanding, etc. Unlike the stereo vision methods which can infer disparity from more images in different viewpoints, monocular depth estimation is an ill-posed and inherently ambiguous problem [1]. Impressive progress has been made in improving the performance of monocular depth estimation through a color image by training a deep network. Several self-supervised approaches have been proposed to train monocular depth estimation models using only synchronized stereo pairs [2] or monocular video [3]. Monocular video is an attractive alternative to stereobased supervision due to its more accessible training data. A pose estimation network is necessary to train the depth estimation model and to constitute the minimum learning framework for monocular training methods. The bottleneck of the unsupervised monocular training methods is very obvious: if the depth map of the target frame is well estimated, the most of pixels in the target frame will be better matched in the synthesized frame after the warp, but there are still a large number of pixels that
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.