Independent object 3D motion estimation is a fundamental problem in 3D computer vision. Directly segmenting and estimating rigid object 3D motion from a consistent video frame is an ill-posed problem. We present a self-supervised framework for segmenting moving independent rigid objects and estimating their motion information including location, driving direction, and speed from a monocular video. Specifically, we first estimate depth, optical flow, and camera pose between a pair of video frames, and then synthesize a new 3D viewpoint from this pair. Subsequently, the Motion Recurrent All-Pairs Field Transforms (MRAFT) module is introduced to extract 3D scene flow and a motion area binary mask from a pair of images and depth. After that, a Rigid Object Motion Estimation Module (ROMEM) with a slot attention mechanism is proposed to extract rigid object motion masks from a multi-layer motion field, including optical flow, depth changes, refined scene flow, and motion masks. Finally, 2D images and 3D scene reconstruction errors are used to facilitate self-supervised training for rigid object motion. Experiments on the FlyingThings3D and KITTI datasets demonstrate that our method outperforms other advanced algorithms in estimating depth, optical flow, scene flow, and rigid moving object masks, demonstrating the benefits of our approach.
Read full abstract