AI-Generated Image Detection With Wasserstein Distance Compression and Dynamic Aggregation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

With the rapid advancement of generative models, image detectors for AI-generated content have become an increasingly necessary technology in computer vision, attracting significant attention from researchers. This technology aims to detect whether an image is naturally generated by imaging systems (e.g., digital cameras) or generated by advanced AI techniques. Despite the promising performance achieved by recent fake detection methods, they are typically trained on millions of redundant images with similar characteristics, leading to inefficient training. Furthermore, the performances of existing detectors often deteriorate when the training datasets are imbalanced. To address these challenges, we propose a novel AI-generated image detector based on dynamic aggregation and information compression with the Wasserstein distance. Experimental results show that our proposed method significantly outperforms state-of-the-art models that generalize across different generative models, with an increase of $\mathbf{+ 1. 8 6 \%}$ average accuracy and $\mathbf{+ 0. 1 4 \%}$ average precision, while substantially reducing the training time. On imbalanced datasets, our proposed method leads to a $\mathbf{+ 1 4. 4 6 \%}$ accuracy improvement, clearly demonstrating its robustness on imbalanced datasets.

Similar Papers
  • Research Article
  • 10.3390/drones9090659
HSFANet: Hierarchical Scale-Sensitive Feature Aggregation Network for Small Object Detection in UAV Aerial Images
  • Sep 19, 2025
  • Drones
  • Hongxing Zhang + 8 more

Small object detection in aerial images, particularly from Unmanned Aerial Vehicle (UAV) platforms, remains a significant challenge due to limited object resolution, dense scenes, and background interference. However, existing small object detectors often overlook making full use of hierarchical features and inevitably introduce noise interference because of hierarchical upsampling operations, and commonly used loss metrics lack sensitivity to scale information; these two issues jointly lead to performance deterioration. To address these issues, we propose Hierarchical Scale-Sensitive Feature Aggregation Network (HSFANet), a novel framework that conducts robust cross-layer feature interaction to perceive the small objects’ position information in hierarchical feature pyramids and enforces the model to balance the multi-scale prediction heads for accurate instances localization. HSFANet introduces a Dynamic Position Aggregation (DPA) module to explicitly enhance the object area in both shallow and deep layers, which is capable of exploiting the complementarily salient representation of the small objects. Additionally, an efficient Scale-Sensitive Loss (SSL) is proposed to balance the small object detection outputs in hierarchical prediction heads, thereby effectively improving the performance of small object detection. Extensive experiments on two challenging UAV benchmarks, VisDrone and UAVDT, demonstrate that HSFANet achieves state-of-the-art (SOTA) results, with a 1.3% gain in overall average precision (AP) and a notable 2.2% improvement in AP for small objects on VisDrone. On UAVDT, HSFANet outperforms previous methods by 0.3% in overall AP and 16.7% in small object AP. These results highlight the effectiveness of HSFANet in enhancing small object detection performance in complex aerial imagery, making it well suited for practical UAV-based applications.

  • Research Article
  • 10.1109/access.2019.2894841
Procedural Learning With Robust Visual Features via Low Rank Prior
  • Jan 1, 2019
  • IEEE Access
  • Haifeng Li + 5 more

In order to apply a convolutional neural network (CNN) to unseen datasets, a common way is to train a CNN using a pre-trained model on a big dataset by fine-tuning it instead of starting from scratch. How to control the fine-tuning progress to get the desired properties is still a challenging problem. Our key observation is that the visual features of the pre-trained model have rich information and can be explored during the training process. A natural thought is to employ these features and design a control strategy to improve the performance of the transfer learning process. In this paper, a procedural learning framework using the learned low-rank component of the visual features both in the pre-trained model and the training process is proposed to improve the accuracy and generalizability of the CNN. In this framework, we presented an approach to yield independent visualization features (IVFs). We found via robust independent component analysis that the low-rank components of IVFs provided robust features for our framework. Then, we design a Wasserstein regularization to control the transportation of the distribution of IVFs from a pre-trained model to a final model via the Wasserstein distance. The experiments on the Cifar-10 and Cifar-100 datasets via a VGG-style CNN model showed that our method effectively improves the classification results and convergence speed. The basic idea is that exploring visual features can also potentially inspire other topics, such as image detection and reinforcement learning.

  • Research Article
  • Cite Count Icon 3
  • 10.62713/aic.3498
An Innovative Deep Learning Approach to Spinal Fracture Detection in CT Images.
  • Aug 20, 2024
  • Annali italiani di chirurgia
  • Haiting Wu + 1 more

Spinal fractures, particularly vertebral compression fractures, pose a significant challenge in medical imaging due to their small-scale nature and blurred boundaries in Computed Tomography (CT) scans. However, advanced deep learning models, such as the integration of the You Only Look Once (YOLO) V7 model with Efficient Layer Aggregation Networks (ELAN) and Max-Pooling Convolution (MPConv) architectures, can substantially reduce the loss of small-scale information during computational processing, thus improving detection accuracy. The purpose of this study is to develop an innovative deep learning approach for detecting spinal fractures, particularly vertebral compression fractures, in CT images. We proposed a novel method to precisely identify spinal injury using the YOLO V7 model as a classifier. This model was enhanced by integrating ELAN and MPConv architectures, which were influenced by the Receptive Field Learning and Aggregation (RFLA) small object recognition framework. Standard normalization techniques were utilized to preprocess the CT images. The YOLO V7 model, integrated with ELAN and MPConv architectures, was trained using a dataset containing annotated spinal fractures. Additionally, to mitigate boundary ambiguities in compressive fractures, a Theoretical Receptive Field (TRF) based on Gaussian distribution and an Effective Receptive Field (ERF) were used to capture multi-scale features better. Furthermore, the Wasserstein distance was employed to optimize the model's learning process. A total of 240 CT images from patients diagnosed with spinal fractures were included in this study, sourced from Ningbo No.2 Hospital, ensuring a robust dataset for training the deep learning model. Our method demonstrated superior performance over conventional object detection networks like YOLO V7 and YOLO V3. Specifically, with a dataset of 200 pathological images and 40 normal spinal images, our method achieved a 3% increase in accuracy compared to YOLO V7. The proposed method offers an innovative and more effective approach for identifying vertebral compression fractures in CT scans. These promising findings suggest the method's potential for practical clinical applications, highlighting the significance of deep learning in enhancing patient care and treatment in medical imaging. Future research should incorporate cross-validation and independent validation and test sets to assess the model's robustness and generalizability. Additionally, exploring other deep learning models and methods could further enhance detection accuracy and reliability, contributing to the development of more effective diagnostic tools in medical imaging.

  • Research Article
  • Cite Count Icon 5
  • 10.3390/fractalfract8110646
Estimation of Fractal Dimension and Detection of Fake Finger-Vein Images for Finger-Vein Recognition
  • Oct 31, 2024
  • Fractal and Fractional
  • Seung Gu Kim + 3 more

With recent advancements in deep learning, spoofing techniques have developed and generative adversarial networks (GANs) have become an emerging threat to finger-vein recognition systems. Therefore, previous research has been performed to generate finger-vein images for training spoof detectors. However, these are limited and researchers still cannot generate elaborate fake finger-vein images. Therefore, we develop a new densely updated contrastive learning-based self-attention generative adversarial network (DCS-GAN) to create elaborate fake finger-vein images, enabling the training of corresponding spoof detectors. Additionally, we propose an enhanced convolutional network for a next-dimension (ConvNeXt)-Small model with a large kernel attention module as a new spoof detector capable of distinguishing the generated fake finger-vein images. To improve the spoof detection performance of the proposed method, we introduce fractal dimension estimation to analyze the complexity and irregularity of class activation maps from real and fake finger-vein images, enabling the generation of more realistic and sophisticated fake finger-vein images. Experimental results obtained using two open databases showed that the fake images by the DCS-GAN exhibited Frechet inception distances (FID) of 7.601 and 23.351, with Wasserstein distances (WD) of 18.158 and 10.123, respectively, confirming the possibility of spoof attacks when using existing state-of-the-art (SOTA) frameworks of spoof detection. Furthermore, experiments conducted with the proposed spoof detector yielded average classification error rates of 0.4% and 0.12% on the two aforementioned open databases, respectively, outperforming existing SOTA methods for spoof detection.

  • Research Article
  • 10.54021/seesv5n3-132
A proposed cooperative approach for edge detection based multi-agent system (MAS) using heterogeneous RGB images
  • Dec 31, 2024
  • STUDIES IN ENGINEERING AND EXACT SCIENCES
  • Nedjoua Houda Kholladi + 2 more

Image segmentation has become a growing research area in image processing, computer vision and pattern recognition. Despite decades of progress and significant accomplishments, challenges remain in feature extraction and model design for image segmentation which make it as complex system required to be solved by intelligent system. This manuscript systematically proposes a cooperative approach for images edge detection based on Multi-Agent System, the proposed approach advance addresses ongoing obstacles of existing image processing methods. Regarding to the image segmentation methods issues is extracting features and providing accurate segmented images. Often edges and regions belonging to the input image are confusing and intricate to extract, due to the proximity of the brightness levels of the images. Many technologies and approaches have been proposed for the identification, classifications, detection and segmentation of images. However, the optimal solution remains a challenging task in computer vision. Our proposed Multi-Agent System(MAS) framework adopts a bottom-up approach to enhance the Quadtree Algorithm while also leveraging a bottom-up strategy through a fusion process facilitated by dynamic aggregation process. This cooperative protocol among region and edge agents enables the framework to achieve more accurate and precise segmentation compared to state-of-the-art methods, even when applied to heterogeneous images from diverse benchmarks. Our approach out performs the compared state of the art techniques with PSNR of 19.2107 dB, MSE of 0.0059, and SSIM of 0.9602.

  • Research Article
  • Cite Count Icon 18
  • 10.3390/electronics13173404
DFS-DETR: Detailed-Feature-Sensitive Detector for Small Object Detection in Aerial Images Using Transformer
  • Aug 27, 2024
  • Electronics
  • Xinyu Cao + 3 more

Object detection in aerial images plays a crucial role across diverse domains such as agriculture, environmental monitoring, and security. Aerial images present several challenges, including dense small objects, intricate backgrounds, and occlusions, necessitating robust detection algorithms. This paper addresses the critical need for accurate and efficient object detection in aerial images using a Transformer-based approach enhanced with specialized methodologies, termed DFS-DETR. The core framework leverages RT-DETR-R18, integrating the Cross Stage Partial Reparam Dilation-wise Residual Module (CSP-RDRM) to optimize feature extraction. Additionally, the introduction of the Detail-Sensitive Pyramid Network (DSPN) enhances sensitivity to local features, complemented by the Dynamic Scale Sequence Feature-Fusion Module (DSSFFM) for comprehensive multi-scale information integration. Moreover, Multi-Attention Add (MAA) is utilized to refine feature processing, which enhances the model’s capacity for understanding and representation by integrating various attention mechanisms. To improve bounding box regression, the model employs MPDIoU with normalized Wasserstein distance, which accelerates convergence. Evaluation across the VisDrone2019, AI-TOD, and NWPU VHR-10 datasets demonstrates significant improvements in the mean average precision (mAP) values: 24.1%, 24.0%, and 65.0%, respectively, surpassing RT-DETR-R18 by 2.3%, 4.8%, and 7.0%, respectively. Furthermore, the proposed method achieves real-time inference speeds. This approach can be deployed on drones to perform real-time ground detection.

  • Research Article
  • Cite Count Icon 293
  • 10.1016/j.isprsjprs.2022.06.002
Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark
  • Jun 11, 2022
  • ISPRS Journal of Photogrammetry and Remote Sensing
  • Chang Xu + 5 more

Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark

  • Research Article
  • Cite Count Icon 1
  • 10.3390/s25175310
Defect Detection in GIS X-Ray Images Based on Improved YOLOv10
  • Aug 26, 2025
  • Sensors (Basel, Switzerland)
  • Guoliang Xu + 2 more

Timely and accurate detection of internal defects in Gas-Insulated Switchgear (GIS) with X-ray imaging is critical for power system reliability. However, automated detection faces significant challenges from small, low-contrast defects and complex background structures. This paper proposes an enhanced object-detection model based on the lightweight YOLOv10n framework, specifically optimized for this task. Key improvements include adopting the Normalized Wasserstein Distance (NWD) loss function for small object localization, integrating Monte Carlo (MCAttn) and Parallelized Patch-Aware (PPA) attention to enhance feature extraction, and designing a GFPN-inspired neck for improved multi-scale feature fusion. The model was rigorously evaluated on a custom GIS X-ray dataset. The final model achieved a mean Average Precision (mAP) of 0.674 (IoU 0.5:0.95), representing a 5.0 percentage point improvement over the YOLOv10n baseline and surpassing other comparative models. Qualitative results also confirmed the model’s enhanced capability in detecting challenging small and low-contrast defects. This study presents an effective approach for automated GIS defect detection, with significant potential to enhance power grid maintenance efficiency and safety.

  • Research Article
  • Cite Count Icon 9
  • 10.3390/rs17142441
SR-YOLO: Spatial-to-Depth Enhanced Multi-Scale Attention Network for Small Target Detection in UAV Aerial Imagery
  • Jul 14, 2025
  • Remote Sensing
  • Shasha Zhao + 5 more

The detection of aerial imagery captured by Unmanned Aerial Vehicles (UAVs) is widely employed across various domains, including engineering construction, traffic regulation, and precision agriculture. However, aerial images are typically characterized by numerous small targets, significant occlusion issues, and densely clustered targets, rendering traditional detection algorithms largely ineffective for such imagery. This work proposes a small target detection algorithm, SR-YOLO. It is specifically tailored to address these challenges in UAV-captured aerial images. First, the Space-to-Depth layer and Receptive Field Attention Convolution are combined, and the SR-Conv module is designed to replace the Conv module within the original backbone network. This hybrid module extracts more fine-grained information about small target features by converting image spatial information into depth information and the attention of the network to targets of different scales. Second, a small target detection layer and a bidirectional feature pyramid network mechanism are introduced to enhance the neck network, thereby strengthening the feature extraction and fusion capabilities for small targets. Finally, the model’s detection performance for small targets is improved by utilizing the Normalized Wasserstein Distance loss function to optimize the Complete Intersection over Union loss function. Empirical results demonstrate that the SR-YOLO algorithm significantly enhances the precision of small target detection in UAV aerial images. Ablation experiments and comparative experiments are conducted on the VisDrone2019 and RSOD datasets. Compared to the baseline algorithm YOLOv8s, our SR-YOLO algorithm has improved mAP@0.5 by 6.3% and 3.5% and mAP@0.5:0.95 by 3.8% and 2.3% on the datasets VisDrone2019 and RSOD, respectively. It also achieves superior detection results compared to other mainstream target detection methods.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 11
  • 10.1088/1361-6560/ac2dd1
Unsupervised domain adaptation model for lesion detection in retinal OCT images
  • Oct 22, 2021
  • Physics in Medicine & Biology
  • Jing Wang + 5 more

Background and objective. Optical coherence tomography (OCT) is one of the most used retinal imaging modalities in the clinic as it can provide high-resolution anatomical images. The huge number of OCT images has significantly advanced the development of deep learning methods for automatic lesion detection to ease the doctor’s workload. However, it has been frequently revealed that the deep neural network model has difficulty handling the domain discrepancies, which widely exist in medical images captured from different devices. Many works have been proposed to solve the domain shift issue in deep learning tasks such as disease classification and lesion segmentation, but few works focused on lesion detection, especially for OCT images. Methods. In this work, we proposed a faster-RCNN based, unsupervised domain adaptation model to address the lesion detection task in cross-device retinal OCT images. The domain shift is minimized by reducing the image-level shift and instance-level shift at the same time. We combined a domain classifier with a Wasserstein distance critic to align the shifts at each level. Results. The model was tested on two sets of OCT image data captured from different devices, obtained an average accuracy improvement of more than 8% over the method without domain adaptation, and outperformed other comparable domain adaptation methods. Conclusion. The results demonstrate the proposed model is more effective in reducing the domain shift than advanced methods.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 54
  • 10.1109/jstars.2022.3221784
PVT-SAR: An Arbitrarily Oriented SAR Ship Detector With Pyramid Vision Transformer
  • Jan 1, 2023
  • IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
  • Yue Zhou + 5 more

The development of deep learning has significantly boosted the development of ship detection in synthetic aperture radar (SAR) images. Most previous works rely on the convolutional neural networks (CNNs), which extract characteristics through local receptive fields and are sensitive to noise. Moreover, these detectors have limited performance in large-scale and complex scenes due to the strong interference of inshore background and the variability of target imaging characteristics. In this article, a novel SAR ship detection framework is proposed, which establishes the pyramid vision transformer (PVT) paradigm for multiscale feature representations in SAR images and, hence, is referred to as PVT-SAR. It breaks the limitation of the CNN receptive field and captures the global dependence through the self-attention mechanism. Since the difficulties of object detection in SAR and natural images are quite different, directly applying the existing transformer structure, such as PVT-small, cannot achieve satisfactory performance for SAR object detection. Compared with the PVT, overlapping patch embedding and mixed transformer encoder modules are incorporated to overcome the problems of densely arranged targets and insufficient data. Then, a multiscale feature fusion module is designed to further improve the detection ability for small targets. Moreover, a normalized Gaussian Wasserstein distance loss is employed to suppress the influence of scattering interference at the ship's boundary. The superiority of the proposed PVT-SAR detector over several state-of-the-art-oriented bounding box detectors has been evaluated in both inshore and offshore scenes on two commonly used SAR ship datasets (i.e., RSSDD and HRSID).

  • Research Article
  • 10.1117/1.jrs.15.046510
Exploiting the sparse characteristics in probabilistic feature space for hyperspectral anomaly detection
  • Dec 28, 2022
  • Journal of Applied Remote Sensing
  • Shaoqi Yu + 3 more

Nowadays, the low-rank representation (LRR) and deep learning-based methods have received much attention in anomaly detection for hyperspectral images (HSIs). However, most of these methods mainly focus on the powerful reconstruction capability of the neural networks while ignoring the potential probability distribution of both anomalies and background pixels. To solve the problem, we propose a sparse component extraction-based probability distribution representation detector (SC-PDRD) framework, which integrates the characteristic of the sparse component obtained by the LRR model with the powerful probability representation ability of the variational autoencoder (VAE) network. The LRR model effectively separates the anomaly component from the background, which also serves as the prior anomaly distribution for each pixel. Moreover, the VAE architecture tries to recover the potential anomaly distribution using the sparse detection map in the feature space. In addition, we employ the Chebyshev neighborhood to leverage spatial information. The modified Wasserstein distance measures the distance between the test pixel and its neighborhood. The final detection map is attained by combining the prior anomalous degree of the anomalies with the output of the VAE network. Experimental results on three real HSIs demonstrate the effectiveness and superiority of SC-PDRD.

  • Research Article
  • 10.7717/peerj-cs.3470
Refining small object detection in aerial images with PF-DETR: a progressive fusion approach
  • Jan 12, 2026
  • PeerJ Computer Science
  • Jing Liu + 5 more

Small object detection remains a challenging task due to limited pixel resolution, complex backgrounds, and high sensitivity to bounding box variations in aerial images. Although Detection Transformer (DETR)-based methods have made progress, they still face significant limitations in small object detection, primarily due to their reliance on global features, which fail to capture fine-grained details and are sensitive to background noise and bounding box variations. This study proposes Progressive Fusion (PF)-DETR, a model specifically designed to refine small object detection through progressive feature fusion techniques. Central to our approach is the Cross-Scale Feature Fusion with S2 (S2-CCFF) module, which integrates multi-level features with an S2 layer to preserve small object details. Coupled with SPace-to-Depth convolution (SPDConv) downsampling, this module reduces computational cost while maintaining critical information. Additionally, Cross Stage Partial Omni-Kernel Fusion (CSPOK-Fusion) Module achieves progressive fusion by gradually integrating multi-scale features from local, large, and global branches through successive convolutional layers, effectively refining the feature representation at each stage, mitigating background interference and occlusion effects to enhance cross-scale spatial representation. We further introduce a Parallelized Patch-Aware (PPA) attention module in the Backbone network to prioritize small object features, significantly addressing information loss. Finally, Normalized Wasserstein Distance (NWD) loss function is incorporated to heighten robustness against minor localization errors by aligning bounding box positioning and shape, thus boosting detection accuracy. Experimental results on the VisDrone and NWPU VHR-10 datasets revealed that PF-DETR surpasses existing state-of-the-art methods, establishing its effectiveness and adaptability in complex aerial detection tasks.

Save Icon
Up Arrow
Open/Close