Abstract

In this paper, we propose a new method for detecting adversarial attacks on deep neural networks. Our algorithm is based on the intuition that attacking input images results in different displacement vectors for clean and adversarial classes. For example, if the input image is an adversarial example, the re-attacking process results in a displacement vector with a short length in the feature space, but this displacement is considerable for clean images. We train our detector based on these displacement vectors. The experimental results show that compared to the current learning-based adversarial detection methods, the proposed system is capable of detecting the adversarial examples using a far simpler network. In addition, the proposed method is independent of the attack type, and is able to detect even novel attacks. It is also revealed that the proposed system learns the discrimination function even using a small amount of training data without any hyper-parameter tuning. We obtain remarkable results in detecting adversarial examples which are placed near and far from the decision boundary, improving state-of-the-art in detecting 2-norm Carlini and Wagner attack (L2-C&W) and ∞-norm Projected Gradient Descent attack (L∞-PGD), where just Fast Gradient Sign Method (FGSM) is used for training the system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call