Visual localization plays a key role in various robot perception systems. Robust visual localization relies on reliable and repeatable local features to establish high quality point correspondences among images. This letter focuses on addressing two limitations of joint learning detector and descriptor. First, existing methods use independent structures and loss functions for keypoint detection and description separately, which poses difficulty in detecting keypoints corresponding to discriminative descriptors. Second, triplet samples are treated equally in most existing approaches, which limits the learning algorithm to obtain highly discriminative descriptors. In this letter, we propose Task-aligned SuperPoint (TaSP) to mitigate the above problems. First, we explicitly align descriptor and detector learning to improve the probability of being detected for those distinctive points. Second, we introduce a dynamic importance weighting module that calculates the weight of each triplet sample based on intrinsic and empirical importance, so as to make the network focus on the most informative triplets during the whole training process. In addition, we resort to 3D space to seek negative samples when forming triplets, which avoids the risk of selecting negatives from repetitive structures. State-of-the-art results on a variety of visual localization benchmarks demonstrate the superiority of our method.
Read full abstract