Despite achieving exceptional performance, deep neural networks (DNNs) suffer from the harassment caused by adversarial examples, which are produced by corrupting clean examples with tiny perturbations. Many powerful defense methods have been presented such as training data augmentation and input reconstruction which, however, usually rely on the prior knowledge of the targeted models or attacks. A clean example and its adversarial version are very similar but have different high-level representations in a victim model. If we can obtain a space in which the representations of similar examples are also similar, then adversarial examples can be picked out by comparing the representations of input examples in this space and the high-level space of the victim model. Inspired by this, we propose a novel approach for detecting adversarial images, which can protect any pre-trained DNN classifiers and resist an endless stream of new attacks. Specifically, we first adopt a dual autoencoder to project images to a latent space. The dual autoencoder uses the self-supervised learning to ensure that small modifications to samples do not significantly alter their latent representations. Next, the mutual information neural estimation is utilized to enhance the discrimination of the latent representations. We then leverage the prior distribution matching to regularize the latent representations. To easily compare the representations of examples in the two spaces, and not rely on the prior knowledge of the targeted model, a simple fully connected neural network is used to embed the learned representations into an eigenspace, which is consistent with the output eigenspace of the targeted model. Through the distribution similarity of an input example in the two eigenspaces, we can judge whether the input example is adversarial or not. Extensive experiments on MNIST, CIFAR-10, and ImageNet show that the proposed method has superior defense performance and transferability than state-of-the-arts.
Read full abstract