Interpreting Universal Adversarial Example Attacks on Image Classification Models
Mitigating adversarial deep learning attacks remains challenging, partly because of the ease and low cost in carrying out such attacks. Therefore, in this paper, we focus on the understanding of universal adversarial example attack on image classification models. Specifically, we seek to understand the difference(s) between adversarial examples in two adversarial datasets (DAmageNet and PGD dataset) and clean examples in ImageNet learned by the classification model, and whether we can use such findings to resist adversarial example attacks. We also seek to determine if we can retrain a discriminator to discriminate whether the input image is an adversarial example, using adversarial training. We then design a number of experiments (e.g., class activation map (CAM) analysis, feature map analysis, feature maps/filters changing, adversarial training, and binary classification model) to help us determine whether the universal adversarial dataset can be successfully used to attack the classification model. This, in turn, contributes to a better understanding of adversarial defenses over pretrained classification model from an interpretation perspective. To the best of our knowledge, this work is one of the earliest works to systematically investigate the interpretation of universal adversarial example attack on image classification models, both visually and quantitatively.
- Conference Article
- 10.1109/ftcs68006.2025.11405781
- Nov 21, 2025
Adversarial attacks introduced subtle perturbations to input images to mislead classification models into producing incorrect predictions. Training models using adversarial examples defended against adversarial attacks. Conventional adversarial examples were generated by perturbing inputs along the gradient ascent direction. In this work, we aimed to determine whether adversarial examples generated along non-gradient ascent directions could improve model’s defensive capability. Therefore, we trained models using four CIFAR100 based datasets: the standard dataset, the standard adversarial example dataset, the dataset generated using perturbations in non-adversarial gradient directions, and the dataset with added random uniform noise. We analyzed the impact of adversarial examples on image classification models from two perspectives: classification accuracy and robustness. Training model using adversarial examples which were generated using perturbations in non-adversarial gradient directions not only defended against adversarial attacks but also improved robustness of the models. Moreover, similar improvements can be achieved by adding random noise perturbations to the training data without requiring a large amount of computation to generate adversarial examples.
- Research Article
- 10.1186/s42400-026-00553-y
- Mar 23, 2026
- Cybersecurity
With the rapid advancement of deep neural networks in wireless communications, applications such as signal modulation recognition and target detection face threats from adversarial example attacks. To enhance system robustness against adversarial attacks, adversarial example detection holds a unique position and role as a complementary approach to conventional adversarial defense methods. This paper investigates the spatial and frequency domain attribute differences between clean and adversarial signal examples, proposing a joint spatial-frequency domain adversarial example detection method for signal modulation recognition networks. In the frequency domain, we extract time-shifted autocorrelation features that capture the peak width differences between clean and adversarial examples, where adversarial perturbations exhibit wider autocorrelation peaks due to their signal-like energy distribution. In the spatial domain, we characterize the inter-layer feature propagation patterns through DNN layers by computing cosine similarities between layer-wise activations and class centers, revealing that adversarial examples exhibit progressive deviation from their true class in deeper layers. These complementary dual-domain features are then fused and classified through a Random Forest ensemble to achieve robust adversarial detection. Experimental results show that the proposed method achieves an adversarial detection rate of 90.32% with an AUC of 0.9475 under PGD attacks, substantially outperforming autoencoder-based and KL-divergence-based baseline detectors by 22.20% and 4.36% respectively. The detector also maintains robust performance across different attack types, achieving detection rates of 98.82% against FGSM and 99.36% against CW attacks. These results validate that the proposed method serves as an effective frontline defense to enhance the adversarial robustness of signal modulation recognition networks.
- Book Chapter
- 10.1007/978-3-031-20865-2_30
- Jan 1, 2022
Adversarial Training (AT) is one of the most effective defense methods against adversarial examples, in which a model is trained on both clean and adversarial examples. Although AT improves the robustness by smoothing the small neighborhood, it reduces accuracy on clean examples. We propose Weighted Adaptive Perturbation Adversarial Training (WAPAT) to reduce the loss of clean accuracy and improve robustness, which is motivated by the adaptive learning rate of the model optimizer. In the adversarial examples generation stage of adversarial training, We introduce weights based on feature changes to adaptively adjust the perturbation step size for different features. In iterative attacks, if a feature is frequently attacked, we increase the attack strength of this area, otherwise, we weaken the attack strength of this area. WAPAT is a data augmentation method that shortens the distance of adversarial examples to the classification boundary. The generated adversarial examples maintain good adversarial effects while retaining more clean examples information. Therefore, such adversarial examples can help us to obtain a more robust model while reducing the loss of recognition accuracy for clean examples. To demonstrate our method, we implement WAPAT in three adversarial training frameworks. Experimental results on CIFAR-10 and MNIST show that WAPAT significantly improves adversarial robustness with less sacrifice of accuracy.KeywordsAdversarial examplesAdversarial trainingWeighted perturbations
- Conference Article
42
- 10.1109/icassp40776.2020.9054750
- Apr 17, 2020
Machine Learning systems are vulnerable to adversarial attacks and will highly likely produce incorrect outputs under these attacks. There are white-box and black-box attacks regarding to adversary's access level to the victim learning algorithm. To defend the learning systems from these attacks, existing methods in the speech domain focus on modifying input signals and testing the behaviours of speech recognizers. We, however, formulate the defense as a classification problem and present a strategy for systematically generating adversarial example datasets: one for white-box attacks and one for black-box attacks, containing both adversarial and normal examples. The white-box attack is a gradient-based method on Baidu DeepSpeech with the Mozilla Common Voice database while the black-box attack is a gradient-free method on a deep model-based keyword spotting system with the Google Speech Command dataset. The generated datasets are used to train a proposed Convolutional Neural Network (CNN), together with cepstral features, to detect adversarial examples. Experimental results show that, it is possible to accurately distinct between adversarial and normal examples for known attacks, in both single-condition and multi-condition training settings, while the performance degrades dramatically for unknown attacks. The adversarial datasets and the source code are made publicly available.
- Conference Article
1
- 10.1109/trustcom56396.2022.00134
- Dec 1, 2022
It is demonstrated that deep neural networks can be easily fooled by adversarial examples. To improve the robustness of neural networks against adversarial attacks, substantial research on adversarial defenses is being carried out, of which input transformation is a typical category of defenses. However, because the transformation also has an impact on the accuracy of clean examples, the existing transformation-based defenses usually adopt minor transformations such as shift and scaling, which limits the defense effect of the transformation to some extent. To this end, we propose a method by using dynamic and diverse transformations for defending against adversarial attacks. Firstly, we constructed a transformation pool that contains both minor and major transformations (e.g., flip, rotate). Secondly, we retrained the model with the data transformed by major transformations to ensure that the performance of model itself is not affected. Finally, we dynamically select transformations to preprocess the input of the model to defend against adversarial examples. We conducted extensive experiments on MNIST and CIFAR-10 datasets and compared our method with the state-of-the-art adversarial training and transformation-based defenses. The experimental results show that our proposed method outperforms the existing methods, improving the robustness of the model against adversarial examples greatly while maintaining high accuracy on clean examples. Our code is available at https://github.com/byerose/DynamicDiverseTransformations.
- Research Article
27
- 10.1109/tpami.2020.3032061
- Oct 19, 2020
- IEEE Transactions on Pattern Analysis and Machine Intelligence
Although deep convolutional neural networks (CNNs) have demonstrated remarkable performance on multiple computer vision tasks, researches on adversarial learning have shown that deep models are vulnerable to adversarial examples, which are crafted by adding visually imperceptible perturbations to the input images. Most of the existing adversarial attack methods only create a single adversarial example for the input, which just gives a glimpse of the underlying data manifold of adversarial examples. An attractive solution is to explore the solution space of the adversarial examples and generate a diverse bunch of them, which could potentially improve the robustness of real-world systems and help prevent severe security threats and vulnerabilities. In this paper, we present an effective method, called Hamiltonian Monte Carlo with Accumulated Momentum (HMCAM), aiming to generate a sequence of adversarial examples. To improve the efficiency of HMC, we propose a new regime to automatically control the length of trajectories, which allows the algorithm to move with adaptive step sizes along the search direction at different positions. Moreover, we revisit the reason for high computational cost of adversarial training under the view of MCMC and design a new generative method called Contrastive Adversarial Training (CAT), which approaches equilibrium distribution of adversarial examples with only few iterations by building from small modifications of the standard Contrastive Divergence (CD) and achieve a trade-off between efficiency and accuracy. Both quantitative and qualitative analysis on several natural image datasets and practical systems have confirmed the superiority of the proposed algorithm.
- Supplementary Content
- 10.1184/r1/13607570.v1
- Jan 25, 2021
- Figshare
While deep networks have contributed to major leaps in raw performance across various applications, they are also known to be quite brittle to targeted data perturbations.By adding a small amount of adversarial noise to the data, it is possible to drastically change the output of a deep network. The existence of these so-called adversarial examples, perturbed data points which fool the model, pose a serious risk for safety- and security-centric applications where reliability and robustness are critical. In this dissertation, we present and analyze a number of approaches for mitigating the effect of adversarial examples, also known as adversarial defenses. These defenses can offer varying degrees and types of robustness, and in this dissertation we study defenses which differ in the strength of the the robustness guarantee, the efficiency and simplicity of the defense, and the type of perturbation being defendedagainst. We start with the strongest type of guarantee called provable adversarial defenses, showing that is possible to compute duality-based certificates that guarantee no adversarial examples exist within an `p-bounded region, which are trainable and can be minimized to learn networks which are provably robust to adversarial attacks. The approach is agnostic to the specific architecture and is applicable to arbitrary computational graphs, scaling to medium sized convolutional networks with random projections. We then switch gears to developing a deeper understanding of a more empirical defense known as adversarial training. Although adversarial training does not come with formal guarantees, it can learn networks more efficiently and with better empirical performance against attacks. We study the optimization process and revealseveral intriguing properties of the robust learning problem, finding that a simple modification to one of the earliest adversarial attacks can be sufficient to learn networksrobust to much stronger attacks, as well as finding that adversarial training as a general procedure is highly susceptible to overfitting. These discoveries have significantimplications on both the efficiency of adversarial training as well as the state of the field: for example, virtually all recent algorithmic improvements in adversarial training can be matched by simply using early stopping. The final component of this dissertation expands the realm of adversarial examples beyond `p-norm bounded perturbations, to enable more realistic threat modelsfor applications beyond imperceptible noise. We define a threat model called the Wasserstein adversarial example, which captures semantically meaningful imagetransformations like translations and rotations previously uncaptured by existing threat models. We present an efficient algorithm for projecting onto Wassersteinballs, enabling both generation of and adversarial training against Wasserstein adversarial examples. Finally, we demonstrate how to generalize adversarial trainingto defend against multiple types of threats simultaneously, improving upon naive aggregations of adversarial attacks.
- Research Article
26
- 10.3390/app10228079
- Nov 14, 2020
- Applied Sciences
State-of-the-art neural network models are actively used in various fields, but it is well-known that they are vulnerable to adversarial example attacks. Throughout the efforts to make the models robust against adversarial example attacks, it has been found to be a very difficult task. While many defense approaches were shown to be not effective, adversarial training remains as one of the promising methods. In adversarial training, the training data are augmented by “adversarial” samples generated using an attack algorithm. If the attacker uses a similar attack algorithm to generate adversarial examples, the adversarially trained network can be quite robust to the attack. However, there are numerous ways of creating adversarial examples, and the defender does not know what algorithm the attacker may use. A natural question is: Can we use adversarial training to train a model robust to multiple types of attack? Previous work have shown that, when a network is trained with adversarial examples generated from multiple attack methods, the network is still vulnerable to white-box attacks where the attacker has complete access to the model parameters. In this paper, we study this question in the context of black-box attacks, which can be a more realistic assumption for practical applications. Experiments with the MNIST dataset show that adversarially training a network with an attack method helps defending against that particular attack method, but has limited effect for other attack methods. In addition, even if the defender trains a network with multiple types of adversarial examples and the attacker attacks with one of the methods, the network could lose accuracy to the attack if the attacker uses a different data augmentation strategy on the target network. These results show that it is very difficult to make a robust network using adversarial training, even for black-box settings where the attacker has restricted information on the target network.
- Conference Article
5
- 10.1109/icip42928.2021.9506383
- Sep 19, 2021
Recent researches have shown that deep neural networks (DNNs) are vulnerable to adversarial examples. Adversarial training is practically the most effective approach to improve the robustness of DNNs against adversarial examples. However, conventional adversarial training methods only focus on the classification results or the instance level relationship on feature representations for adversarial examples. Inspired by the fact that adversarial examples break the distinguishability of the feature representations of DNNs for different classes, we propose Intra and Inter Class Feature Regularization $(\mathrm{I}^{2}$ FR) to make the feature distribution of adversarial examples maintain the same classification property as clean examples. On the one hand, the intra-class regularization restricts the distance of features between adversarial examples and both the corresponding clean data and samples for the same class. On the other hand, the inter-class regularization prevents the feature of adversarial examples from getting close to other classes. By adding $\mathrm{I}^{2}$ FR in both adversarial example generation and model training steps in adversarial training, we can get stronger and more diverse adversarial examples, and the neural network learns a more distinguishable and reasonable feature distribution. Experiments on various adversarial training frameworks demonstrate that $\mathrm{I}^{2}$ FR is adaptive for multiple training frameworks and outperforms the state-of-the-art methods for classification of both clean data and adversarial examples.
- Conference Article
26
- 10.1109/isvlsi.2018.00092
- Jul 1, 2018
Some recent works revealed that deep neural networks (DNNs) are vulnerable to so-called adversarial attacks where input examples are intentionally perturbed to fool DNNs. In this work, we revisit the DNN training process that includes adversarial examples into the training dataset so as to improve DNN's resilience to adversarial attacks, namely, adversarial training. Our experiments show that different adversarial strengths, i.e., perturbation levels of adversarial examples, have different working zones to resist the attack. Based on the observation, we propose a multi-strength adversarial training method (MAT) that combines the adversarial training examples with different adversarial strengths to defend adversarial attacks. Two training structures - mixed MAT and parallel MAT - are developed to facilitate the tradeoffs between training time and memory occupation. Our results show that MAT can substantially minimize the accuracy degradation of deep learning systems to adversarial attacks on MNIST, CIFAR-10, CIFAR-100, and SVHN.
- Research Article
56
- 10.1016/j.media.2021.101977
- Jan 22, 2021
- Medical Image Analysis
Towards evaluating the robustness of deep diagnostic models by adversarial attack.
- Research Article
6
- 10.1109/tcss.2023.3291565
- Dec 1, 2024
- IEEE Transactions on Computational Social Systems
Face recognition (FR) models can be easily fooled by adversarial examples, which are crafted by adding imperceptible perturbations on benign face images. The existence of adversarial face examples poses a great threat to the security of society. To build a more sustainable digital nation, in this article, we improve the transferability of adversarial face examples to expose more blind spots of the existing FR models. Though generating hard samples has shown its effectiveness in improving the generalization of models in training tasks, the effectiveness of using this idea to improve the transferability of adversarial face examples remains unexplored. To this end, based on the property of hard samples and the symmetry between training tasks and adversarial attack tasks, we propose the concept of hard models, which have similar effects as hard samples for adversarial attack tasks. Using the concept of hard models, we propose a novel attack method called beneficial perturbation feature augmentation attack (BPFA), which reduces the overfitting of adversarial examples to surrogate FR models by constantly generating new hard models to craft the adversarial examples. Specifically, in the backpropagation, BPFA records the gradients on preselected feature maps and uses the gradient on the input image to craft the adversarial example. In the next forward propagation, BPFA leverages the recorded gradients to add beneficial perturbations on their corresponding feature maps to increase the loss. Extensive experiments demonstrate that BPFA can significantly boost the transferability of adversarial attacks on FR.
- Research Article
1
- 10.1117/1.jei.32.2.023023
- Mar 28, 2023
- Journal of Electronic Imaging
Adversarial example generation (AEG) has been a hot spot in recent years because it can cause deep neural networks (DNNs) to misclassify the generated adversarial examples, which reveals the vulnerability of DNNs, motivating us to find good solutions to improve the robustness of DNN models. Due to the extensiveness and high liquidity of natural language over the social networks, various natural language-based adversarial attack algorithms have been proposed in the literature. These algorithms generate adversarial text examples with high semantic quality. However, the generated adversarial text examples and the corresponding attack models may be maliciously or illegally used. To tackle this problem, we present a general framework encapsulated in the cloud application programming interfaces (APIs) for generating watermarked adversarial text examples to protect adversarial text examples and corresponding adversarial text attack models. For each word in a given text, a set of candidate words are determined to ensure that all the words in the set can be used to carry secret bits or facilitate the construction of adversarial example. By applying a word-level adversarial text generation algorithm, the watermarked adversarial text example can be finally generated. Experiment results show that the adversarial text examples generated by the proposed method not only successfully fool advanced DNN models, but also carry watermarks that can effectively verify the ownership and trace the source of the adversarial examples and the corresponding attack models. Moreover, the watermark can still survive after attacked with AEG algorithms, which has shown the applicability and superiority.
- Research Article
3
- 10.1109/tcad.2020.2969982
- Jan 31, 2020
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Deep neural networks (DNNs) have shown phenomenal success in many real-world applications. However, a concerning weakness of DNNs is their vulnerability to adversarial attacks. Although there exist some methods to detect adversarial attacks, they often suffer from high computational cost and constraints on certain types of attacks, and ignore external features that could aid during attack detection. In this article, we propose fast confidence detection method (FCDM), an innovative method for fast confidence detection of adversarial attacks based on measuring the integrity of sensor pattern noise fingerprinting embedded in input examples. We note that the existing adversarial detectors are often designed as a binary classifier to differentiate clean or adversarial examples. However, the detection of adversarial examples can be much more complicated than such a scenario. Our key insight is that the confidence level of detecting an input sample as an adversarial example is a more useful info for the system to properly take an action to resist potential attacks. The experimental results show that FCDM is capable to give a confidence distribution model of the most popular adversarial attacks. And, using the confidence distribution model, FCDM can quickly determine the confidence level of the input sample. Based on different properties of the confidence distribution models associated with these adversarial attacks, FCDM can provide early attack warning including even the possible attack types of the adversarial attack examples. FCDM also has the following advantages: 1) it is effective for both a white-box attack and black-box attack; 2) it do not depend on the class of adversarial attacks and can be used as both known attack defense and unknown attack defense; and 3) it does not need to know the details of the DNN model and does not affect the functionality of the DNN. Since fast confidence detection method (FCDM) is a computationally heavy task, we propose an FPGA-based accelerator based on a series of optimization techniques, such as the quantization, data reuse and operation replacement, etc. We implement our method on an FPGA platform and achieve a system clock frequency of 279 MHz with a power consumption of the only 0.7626 W. Moreover, in the real system performance test, we obtain a high efficiency of 29.740 IPS/W and a low latency of just 44.1 ms with very marginal accuracy loss.
- Research Article
9
- 10.1109/tpami.2024.3411035
- Dec 1, 2024
- IEEE transactions on pattern analysis and machine intelligence
Adversarial attacks have been proven to be potential threats to Deep Neural Networks (DNNs), and many methods are proposed to defend against adversarial attacks. However, while enhancing the robustness, the accuracy for clean examples will decline to a certain extent, implying a trade-off existed between the accuracy and adversarial robustness. In this paper, to meet the trade-off problem, we theoretically explore the underlying reason for the difference of the filters' weight distribution between standard-trained and robust-trained models and then argue that this is an intrinsic property for static neural networks, thus they are difficult to fundamentally improve the accuracy and adversarial robustness at the same time. Based on this analysis, we propose a sample-wise dynamic network architecture named Adversarial Weight-Varied Network (AW-Net), which focuses on dealing with clean and adversarial examples with a "divide and rule" weight strategy. The AW-Net adaptively adjusts the network's weights based on regulation signals generated by an adversarial router, which is directly influenced by the input sample. Benefiting from the dynamic network architecture, clean and adversarial examples can be processed with different network weights, which provides the potential to enhance both accuracy and adversarial robustness. A series of experiments demonstrate that our AW-Net is architecture-friendly to handle both clean and adversarial examples and can achieve better trade-off performance than state-of-the-art robust models.