Abstract As the applications of deep neural networks broaden, enhancing their robustness against adversarial attacks becomes increasingly important. Multiple studies have demonstrated their vulnerability to adversarial perturbations. This paper presents an iterative method employing a probabilistic diffusion model, guided by a self-supervised training strategy, to purify potential adversarial inputs. Specifically, the diffusion process blends the adversarial noise with incrementally added Gaussian noise. Subsequently, both types of noise are removed during the guided denoising process. Unlike existing methods that use a fixed number of iterations and rely on the adversarial input for guidance, our scoring-based approach dynamically adjusts the purification duration for each individual image, thereby reducing computational overhead and minimizing the side effects caused by excessive purification. Additionally, we modify the input at each iteration using a self-supervised strategy and then utilize this modified input to guide the denoising process, resulting in better adversarial robustness. Since the proposed method operates without any label information, it can be applied to various training paradigms, including supervised, semi-supervised, self-supervised, and unsupervised learning. We conduct several experiments on widely used machine vision datasets to evaluate the efficacy of the proposed method. The results confirm that our method effectively eliminates adversarial perturbations across these datasets. For example, under a white-box PGD attack with an l ∞ ball (ϵ = 8/255) on CIFAR-10, our method achieves a robust accuracy of 92.13%, surpassing the state-of-the-art by 3.68%.
Read full abstract