Abstract

In recent years, researchers have made significant strides in computer vision by leveraging transformers, achieving remarkable breakthroughs in low-level vision tasks. The inherent long-range dependency of transformers grants them potent remote modeling capabilities, surpassing those of Convolutional Neural Networks (CNNs) and enabling the extraction of global features and accurate semantic structures. However, it has been observed that single transformer frameworks lack sensitivity to high-frequency information in images, resulting in the generation of blurry reconstructed regions. To address this limitation, this paper proposes a novel two-branch Dual Frequency Feature Fusion Network (DF3Net) for image inpainting based on the hierarchical atrous transformer (HAT). Specifically, the head of the dual-frequency convolution (DFC) module decouples the feature maps into low and high-frequency components. The low-frequency factor goes through the proposed HAT branch, while the high-frequency component is input into the gated convolution module branch, effectively capturing both global structural information and local texture details. Finally, the DFC tail fuses the high and low-frequency features to output the reconstructed image. Moreover, the feature fusion between high and low-frequency branches is performed layer-wise within the network, enabling mutual learning between the two branches interactively, ensuring coherence in image semantic structure and texture details. Experimental evaluations on Places2, Paris StreetView, and CelebA-HQ with different mask ratios demonstrate that the proposed method outperforms state-of-the-art methods in enhancing the structural accuracy of image inpainting. It generates semantically reasonable images with fine texture details.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call