Detecting adversarial examples by positive and negative representations

Wenjian Luo,Chenwang Wu,Li Ni,Nan Zhou,Zhenya Zhang

doi:10.1016/j.asoc.2021.108383

Abstract

Deep neural networks (DNNs) have been successfully applied in various fields. However, it has been demonstrated that a well-designed and quasi-imperceptible perturbation can confuse the targeted DNNs classifier with high confidence and lead to misclassification. Examples with such perturbations are called adversarial examples, and it is a challenging task to detect them. In this paper, we propose a positive–negative detector (PNDetector) to detect adversarial examples. The PNDetector is based on a positive–negative classifier (PNClassifier), which is trained by both the original examples (called positive representations) and their negative representations with the same structural and semantic features. The principle of the PNDetector is that the feature space of the positive and negative representations of adversarial examples under the PNClassifier has a high probability of belonging to different categories, while its performance on clean examples is not reduced by adding negative example representations into the train set. We test the PNDetector with adversarial examples generated by eight typical attack methods on four typical datasets. The experimental results demonstrate that the proposed detector is efficient in all datasets and under all attack types. Furthermore, its detection performance is comparable to that of state-of-the-art methods.

Full Text