Abstract

Local feature matching involves the task of establishing the pixel-wise correspondences between a pair of images. As an integral component of plentiful computer vision applications (e.g., visual localization), this task has been successfully performed using Transformer-based methods. However, these methods typically extract numerous keypoints from sparse texture regions to construct a densely-connected graph neural network (GNN) for long-range feature aggregation, which inevitably triggers redundant message exchange and hampers the learning process. Furthermore, they employ transformer encoder layers that consider images as 1D sequences, leaving them incapable of extracting multiscale local structural information from the images, which is critical for establishing correspondence in image pairs with significantly scales shifts. In this study, we develop FMAP, an innovative detector-free approach that enables accurate local feature matching. For the first issue, FMAP employs an anchor points feature aggregation module (APAM) that captures representative keypoints and discards the extraneous keypoints to build a sparsified GNN for compact yet clean message exchange, with the key insight that the keypoints containing abundant visual information are distinguishable from their neighbors. For the second issue, FMAP proposes a global–local multiscale perception module (GMPM), which incorporates abundant multiscale local context information into global feature representation by employing multiple depth-wise convolutions with varying kernel sizes, thereby generating discriminative features that are robust to scale shifts. In addition, the depth-wise convolutions are utilized in the feed-forward network of the GMPM to further fuse the global context information and local feature representation. Extensive experiments on several standard benchmarks demonstrate that the proposed FMAP method significantly outperforms state-of-the-art methods. Compared to the cutting-edge methods MatchFormer, QuadTree, and TopicFM in relative pose estimation task, FMAP surpasses them by 2.27%, 0.58%, and 1.08% in terms of AUC@5°. Besides, FMAP noticeably outperforms the baseline LoFTR by (2.38%,1.89%,1.45%) in terms of AUC@(5°, 10°, 20°). Moreover, we integrate FMAP into an official visual localization framework and conduct a visual localization experiment, with the results showing that FMAP exceeds LoFTR by 2.3% in terms of AP.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call