We propose a novel correlative region-focused transformer for accurate homography estimation by a bilevel progressive architecture. Existing methods typically consider the entire image features to establish correlations for a pair of input images, but irrelevant regions often introduce mismatches and outliers. In contrast, our network effectively mitigates the negative impact of irrelevant regions through a bilevel progressive homography estimation architecture. Specifically, in the outer iteration, we progressively estimate the homography matrix at different feature scales; in the inner iteration, we dynamically extract correlative regions and progressively focus on their corresponding features from both inputs. Moreover, we develop a quadtree attention mechanism based on the transformer to explicitly capture the correspondence between the input images, localizing and cropping the correlative regions for the next iteration. This progressive training strategy enhances feature consistency and enables precise alignment with comparable inference rates. Extensive experiments on qualitative and quantitative comparisons show that the proposed method exhibits competitive alignment results while reducing the mean average corner error (MACE) on the MS-COCO dataset compared to previous methods, without increasing additional parameter cost.
Read full abstract