Abstract

While recent efforts have shown that neural text processing models are vulnerable to adversarial examples, comparatively little attention has been paid to explicitly characterize their effectiveness. To overcome this, we present analytical insights into the word frequency characteristics of word-level adversarial examples for neural text classification models. We show that adversarial attacks against CNN-, LSTM- and Transformer-based classification models perform token substitutions that are identifiable through word frequency differences between replaced words and their substitutions. Based on these findings, we propose frequency-guided word substitutions (FGWS) as a simple algorithm for the automatic detection of adversarially perturbed textual sequences. FGWS exploits the word frequency properties of adversarial word substitutions, and we assess its suitability for the automatic detection of adversarial examples generated from the SST-2 and IMDb sentiment datasets. Our method provides promising results by accurately detecting adversarial examples, with F1 detection scores of up to 93.7based classification models. We compare our approach against baseline detection approaches as well as a recently proposed perturbation discrimination framework, and show that we outperform existing approaches by up to 15.1% F1 in our experiments.

Highlights

  • Artificial neural networks are vulnerable to adversarial examples—carefully crafted perturbations of input data that lead a learning model into making false predictions (Szegedy et al, 2014).While initially discovered for computer vision tasks, natural language processing (NLP) models have been shown to be oversensitive to adversarial input perturbations for a variety of tasks (Papernot et al, 2016; Jia and Liang, 2017; Belinkov and Bisk, 2018; Glockner et al, 2018; Iyyer et al, 2018)

  • We show that we can achieve an improved performance for the detection and correction of adversarial examples based on the finding that various word-level adversarial attacks have a tendency to replace input words with less frequent ones

  • We propose frequency-guided word substitutions (FGWS), a detection method that estimates whether a given input sequence is an adversarial example

Read more

Summary

Introduction

Artificial neural networks are vulnerable to adversarial examples—carefully crafted perturbations of input data that lead a learning model into making false predictions (Szegedy et al, 2014). We focus on highly successful synonym substitution attacks (Alzantot et al, 2018; Ren et al, 2019; Zang et al, 2020), in which individual words are replaced with semantically similar ones. Existing defense methods against these attacks mainly focus on adversarial training Recent work by Zhou et al (2019) instead proposes DISP (learning to discriminate perturbations), a perturbation discrimination framework that exploits pre-trained contextualized word representations to detect and correct word-level adversarial substitutions without having to retrain the attacked model. We show that we can achieve an improved performance for the detection and correction of adversarial examples based on the finding that various word-level adversarial attacks have a tendency to replace input words with less frequent ones.. We provide statistical evidence to support this observation and propose a rule-based and modelagnostic algorithm, frequency-guided word substitutions (FGWS), to detect adversarial sequences

Generating adversarial examples
Analyzing frequencies of adversarial word substitutions
Frequency-guided word substitutions
Comparisons
DISP FGWS
Experiments
Results
FGWS on unperturbed data
Limitations
Conclusion
A Dataset statistics
RoBERTa
GENETIC
C Frequency differences for CNN and LSTM models
D Bayes factors
F Varying false positive thresholds
G Additional FGWS examples
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.