In real scenes, humans can easily infer their positions and distances from other objects with their own eyes. To make the robots have the same visual ability, this paper presents an unsupervised OnionNet framework, including LeafNet and ParachuteNet, for single-view depth prediction and camera pose estimation. In OnionNet, for speeding up OnionNet’s convergence and concretizing objects against the gradient locality and moving objects in videos, LeafNet adopts two decoders and enhanced upconvolution modules. Meanwhile, for improving the robustness of fast camera movement and rotation, ParachuteNet uses and integrates three pose networks to estimate multi-view camera pose parameters by combining with the modified image preprocess. Different from existing methods, single-view depth prediction and camera pose estimation are trained view-by-view, where the variations between views is gradual reduction of view range and outer pixels disappear in next view, similar to onion peeling. Moreover, the LeafNet is optimized with pose parameter from each pose network in turn. Experimental results on the KITTI dataset show the outstanding effectiveness of our method: single-view depth performs better than most supervised and unsupervised methods which contain two same subtasks, and pose estimation gets the state-of-the-art performance compared with existing methods under the comparable input settings.
Read full abstract