Abstract

Unsupervised learning-based scene perception has recently become an important research direction. Most unsupervised methods for scene perception tasks (e.g., dense depth recovery and ego-motion estimation) train the convolutional network via minimizing the photometric error of images, achieving very impressive results. Since the supervision signal is weaker than ground truth data, the existing unsupervised methods generally do poorly inaccurate pose estimation and high-resolution depth map generation. To this end, we present an architecture based on the convolutional neural network and the Kalman filter, which is used for unsupervised learning of accurate ego-motion and high-resolution single view depth, with as fewer parameters as possible. Specifically, we first present pose network (P-CNN) with the decoupled (multi-stream) convolutional network for learning accurate camera pose (translation and rotation vector) to avoid the fault aliasing of multiple camera pose, with a relatively light-weighted network. Then, for the sake of improving the relationship of features between consecutive image pairs, we introduce the Kalman filter into the learning framework to improve the smoothness of the estimated camera pose. Finally, in order to decode a high-resolution depth map with a less uneven and unsmooth area, we adopt a new upsampling module in the encoder-decoder architecture of depth network (D-CNN). The extensive experiments are implemented on the KITTI driving dataset, proving that our method obviously predicts a more accurate camera pose and a clear depth map.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call