In powerful adversarial attacks against deep neural networks (DNN), the generated adversarial example will mislead the DNN-implemented classifier by destroying the features of the last layer. To enhance the robustness of the classifier, in our paper, a Feature Analysis and Conditional Matching prediction distribution (FACM) model is proposed to utilize the features of intermediate layers to correct the misclassification. Specifically, we first prove that the intermediate layers of the classifier still retain effective features for the original category when the classifier is subjected to adversarial attacks, which is defined as the Correction Property in our paper. According to this, we propose the FACM model consisting of Feature Analysis (FA) correction module, Conditional Matching Prediction Distribution (CMPD) correction module and decision module. Specifically, the FA correction module is comprised of fully connected layers, which takes the features of the intermediate layers as the input to correct the misclassification of the classifier. The CMPD correction module is based on a conditional autoencoder, which not only uses the features of intermediate layers as the condition to accelerate convergence but also mitigates the negative effect of adversarial examples, trained with the Kullback–Leibler loss to match prediction distribution. Through the empirically verified Diversity Property among the individual correction modules, the decision module is proposed to integrate the proposed correction modules to enhance the DNN-implemented classifier’s robustness by reducing the dimensionality of adversarial subspace. That is, the input perturbed in certain directions (i.e., dimensions) that lead to misclassifications for the classifier can be correctly classified by the proposed correction modules. The extended experiments demonstrate our FACM model outperforms the existing methods against adversarial attacks, especially optimization-based white-box attacks and query-based black-box attacks.
Read full abstract