Abstract

Independent object 3D motion estimation is a fundamental problem in 3D computer vision. Directly segmenting and estimating rigid object 3D motion from a consistent video frame is an ill-posed problem. We present a self-supervised framework for segmenting moving independent rigid objects and estimating their motion information including location, driving direction, and speed from a monocular video. Specifically, we first estimate depth, optical flow, and camera pose between a pair of video frames, and then synthesize a new 3D viewpoint from this pair. Subsequently, the Motion Recurrent All-Pairs Field Transforms (MRAFT) module is introduced to extract 3D scene flow and a motion area binary mask from a pair of images and depth. After that, a Rigid Object Motion Estimation Module (ROMEM) with a slot attention mechanism is proposed to extract rigid object motion masks from a multi-layer motion field, including optical flow, depth changes, refined scene flow, and motion masks. Finally, 2D images and 3D scene reconstruction errors are used to facilitate self-supervised training for rigid object motion. Experiments on the FlyingThings3D and KITTI datasets demonstrate that our method outperforms other advanced algorithms in estimating depth, optical flow, scene flow, and rigid moving object masks, demonstrating the benefits of our approach.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.