Abstract

As scene coordinate regression (SCoRe) methods become prevailing in the area of visual camera localization, the issue of repetitive or sparse texture scenes continues to be a concern. Specifically, they will suffer from performance degeneration due to ambiguous patterns caused by visual similarity. In this work, we propose a novel network for camera localization through a single RGB image, with our key insight that taking only high-level feature maps as input can be difficult for the network to accurately model the regression problem due to ambiguous patterns and utilizing the rich spatial details in low-level feature maps can tackle this issue. The core components of the network are 1) Deep Feature Aggregation Module (DFAM), which eliminates the difference among the different level feature representations and fuses multi-level context information; 2) CoordConv Scheme, which further improves the discrimination of features in repetitive or sparse texture areas of the image; 3) Deep Supervision, which endows low-level feature maps with direct supervision from the ground truth to improve the accuracy of camera localization; 4) Uncertainty Modeling, which quantifies the prediction errors stemming from the intrinsic noise in the data. Moreover, to maximize the power of DFAM, we embed channel attention modules into it to prune redundant and noisy features, through which we can refine the different level feature maps. Our network is designed to be lightweight and efficient, and the proposed DFAM can be integrated into general SCoRe-based networks. Comprehensive experiments demonstrate the effectiveness of DFAM and the superiority of our network over the state-of-the-art methods on two benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call