Abstract Image forgery localization identifies tampered regions within an image by extracting distinctive forgery features. Current methods mainly use convolutional neural networks (CNNs) to extract features. However, CNNs’ limited receptive field emphasizes local features, impeding the global modeling of crucial lower-level features like edges and textures, leading to decreased precision. Moreover, prior methods use pyramid networks for multi-scale feature extraction but show deficiencies in multi-scale and interlayer modeling, leading to inadequate multi-scale information representation and limiting flexibility to tampered regions of varying sizes. To address these issues, this paper proposes a Transformer-based model integrating multi-scale and boundary features. The model employs a Pyramid Vision Transformer as the encoder, using self-attention over convolution to enhance global context modeling. Building on this, the model incorporates a multi-scale feature enhancement module that enriches forgery features by paralleling various convolutional layers. Features at various encoder stages are integrated through a cross-stage interaction module, enabling multi-level feature correlation for a strong feature representation. Furthermore, the model includes a forgery boundary information-guided branch, which focuses precisely on tampered region structures without introducing irrelevant noise. Experiments demonstrate that our model surpasses previous methods in localization accuracy, with F1 and AUC improving by 8.5% and 2.2% in pre-training, respectively.
Read full abstract