The generation of generalizable and discriminative descriptors plays a crucial role in image matching and 3D reconstruction. While numerous existing solutions are concentrated on encoding specific invariances, such as illumination or viewpoint invariance, they often face challenges in achieving robustness and generalization. These challenges arise from the frequent inadequacy of these solutions to effectively adapt to diverse and demanding environments due to their limited information capacity. In this paper, we introduce a novel approach aimed at maximizing the utilization of hidden feature informativeness to address these challenges. Specifically, we propose the Hierarchical Context-aware Aggregation Network (HCNet), which employs a hierarchical dense features constraint in a coarse-to-refinement description manner. In this approach, a coarse-level descriptor is used to present the overall information, while the refinement descriptor captures the detailed information of the image. Leveraging the strengths of both CNN and Transformer architectures, our hierarchical dense feature constraint encodes both local features and long-range information to efficiently generate dense feature descriptions. To boost descriptor informativeness and enhance matching accuracy, we introduce the Context-aware Attention Aggregation (CAA) model, which adaptively aggregates features from various scales through an efficient coarse-to-refinement manner. Additionally, we design a hierarchical triplet training strategy that considers both variant and invariant properties of hierarchical features, aiming to enhance descriptor informativeness while preserving their strong discriminative qualities. Our experiments, conducted on two popular feature-matching benchmarks, as well as a challenging long-term visual localization benchmark, demonstrate that our method significantly improves matching accuracy and outperforms state-of-the-art descriptors. Moreover, our approach exhibits superior generalization capabilities in various 3D reconstruction scenarios.
Read full abstract