Estimating the 6DoF pose of objects in complex scenarios is one of the core challenges in environmental perception for unmanned systems. Recently, mesh-free pose estimation methods based on “inverse” NeRF have achieved state-of-the-art (SOTA) accuracy under ideal data conditions compared to traditional methods. However, the overall performance of this strategy is suboptimal due to some overlooked details, such as NeRF’s sampling of pixel backpropagation, which can lead to local minima in high-resolution images. Random initialization of poses increases the difficulty of network convergence and estimation bias. Pose estimation neglects geometric consistency constraints, resulting in low robustness to occluded environments. To address these issues, this paper proposes a “coarse-to-fine” NeRF pose prediction framework (C2Fi-NeRF). During the training phase, an affinity-based full-pixel backpropagation strategy is proposed, abandoning the sparse sampling of traditional NeRF training. The complete gradient map is divided into affinity blocks, which are rendered and backpropagated in sequence. This not only achieves efficient full-pixel training for high-resolution images but also significantly improves the quality and consistency of rendered images, reducing noise and artifacts. The prediction phase is divided into two parts: the coarse phase optimizes reprojection errors through feature point matching to introduce precise initialization data, accelerating NeRF convergence and reducing bias potential. The fine estimation phase integrates multi-view geometry and color consistency constraints in the inverse NeRF iteration, enhancing pixel rendering robustness in complex occluded scenarios while also refining pose prediction. Experiments demonstrate that compared to existing NeRF-based and mainstream deep learning methods, C2Fi-NeRF is competitive in accuracy and efficiency on relevant datasets (NeRF-Synthetic, LLFF, Replica, YCB) and is more suitable for practical robotic applications.
Read full abstract