We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The fused features can also be used to predict semantic labels, allowing our method to reconstruct and segment the 3D scene simultaneously. Furthermore, we purpose an efficient self-supervised fine-tuning scheme that refines scene geometry based on input images through differentiable volume rendering. This fine-tuning scheme improves reconstruction quality on the fine-tuned scenes, as well as the generalization to similar test scenes. The experiments on ScanNet, 7-Scenes and Replica datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed.
Read full abstract