Abstract

White-box adversarial example (AE) attacks on deep neural networks (DNNs) have a more powerful destructive capacity than black-box attacks using AE strategies. However, few studies have been conducted on the generation of low-perturbation adversarial examples from the interpretability perspective. Specifically, adversaries who conducted attacks lacked interpretation from the point of view of DNNs, and the perturbation was not further considered. To address these, we propose an interpretable white-box AE attack approach, DI-AA, which not only explores the application of the interpretable method of deep Taylor decomposition in selecting the most contributing features but also adopts the Lagrangian relaxation optimization of the logit output and Lp norm to make the perturbation more unnoticeable. We compare DI-AA with eight baseline attacks on four representative datasets. Experimental results reveal that our approach can (1) attack nonrobust models with low perturbation, where the perturbation is closer to or lower than that of the state-of-the-art white-box AE attacks; (2) evade the detection of the adversarial-training robust models with the highest success rate; (3) be flexible in the degree of AE generation saturation. Additionally, the AE generated by DI-AA can reduce the accuracy of the robust black-box models by 16–31 % in the black-box manner.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call