Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Maximilian Mozes,Lewis Griffin,Pontus Stenetorp,Bennett Kleinberg

doi:10.18653/v1/2021.eacl-main.13

Abstract

While recent efforts have shown that neural text processing models are vulnerable to adversarial examples, comparatively little attention has been paid to explicitly characterize their effectiveness. To overcome this, we present analytical insights into the word frequency characteristics of word-level adversarial examples for neural text classification models. We show that adversarial attacks against CNN-, LSTM- and Transformer-based classification models perform token substitutions that are identifiable through word frequency differences between replaced words and their substitutions. Based on these findings, we propose frequency-guided word substitutions (FGWS) as a simple algorithm for the automatic detection of adversarially perturbed textual sequences. FGWS exploits the word frequency properties of adversarial word substitutions, and we assess its suitability for the automatic detection of adversarial examples generated from the SST-2 and IMDb sentiment datasets. Our method provides promising results by accurately detecting adversarial examples, with F1 detection scores of up to 93.7based classification models. We compare our approach against baseline detection approaches as well as a recently proposed perturbation discrimination framework, and show that we outperform existing approaches by up to 15.1% F1 in our experiments.

Highlights

Artificial neural networks are vulnerable to adversarial examples—carefully crafted perturbations of input data that lead a learning model into making false predictions (Szegedy et al, 2014).While initially discovered for computer vision tasks, natural language processing (NLP) models have been shown to be oversensitive to adversarial input perturbations for a variety of tasks (Papernot et al, 2016; Jia and Liang, 2017; Belinkov and Bisk, 2018; Glockner et al, 2018; Iyyer et al, 2018)
We show that we can achieve an improved performance for the detection and correction of adversarial examples based on the finding that various word-level adversarial attacks have a tendency to replace input words with less frequent ones
We propose frequency-guided word substitutions (FGWS), a detection method that estimates whether a given input sequence is an adversarial example

Summary

Introduction

Artificial neural networks are vulnerable to adversarial examples—carefully crafted perturbations of input data that lead a learning model into making false predictions (Szegedy et al, 2014). We focus on highly successful synonym substitution attacks (Alzantot et al, 2018; Ren et al, 2019; Zang et al, 2020), in which individual words are replaced with semantically similar ones. Existing defense methods against these attacks mainly focus on adversarial training Recent work by Zhou et al (2019) instead proposes DISP (learning to discriminate perturbations), a perturbation discrimination framework that exploits pre-trained contextualized word representations to detect and correct word-level adversarial substitutions without having to retrain the attacked model. We show that we can achieve an improved performance for the detection and correction of adversarial examples based on the finding that various word-level adversarial attacks have a tendency to replace input words with less frequent ones.. We provide statistical evidence to support this observation and propose a rule-based and modelagnostic algorithm, frequency-guided word substitutions (FGWS), to detect adversarial sequences

Generating adversarial examples

Analyzing frequencies of adversarial word substitutions

Frequency-guided word substitutions

Comparisons

DISP FGWS

Experiments

Results

FGWS on unperturbed data

Limitations

Conclusion

A Dataset statistics

RoBERTa

GENETIC

C Frequency differences for CNN and LSTM models

D Bayes factors

F Varying false positive thresholds

G Additional FGWS examples

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 22	License type: cc-by

Similar Papers

FCDM: A Methodology Based on Sensor Pattern Noise Fingerprinting for Fast Confidence Detection to Adversarial Attacks
Yazhu Lan ... Yuanchao Xu
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 39
Yazhu Lan, et. al.Yazhu Lan ... Yuanchao Xu
31 Jan 2020
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 39

ACT-Detector: Adaptive channel transformation-based light-weighted detector for adversarial attacks
Jinyin Chen ... Shouling Ji
Information Sciences | VOL. 564
Jinyin Chen, et. al.Jinyin Chen ... Shouling Ji
16 Feb 2021
Information Sciences | VOL. 564

ADS-detector: An attention-based dual stream adversarial example detection method
Sensen Guo ... Zhiying Mu
Knowledge-Based Systems | VOL. 265
Sensen Guo, et. al.Sensen Guo ... Zhiying Mu
14 Feb 2023
Knowledge-Based Systems | VOL. 265

Adversarial Example Detection Using Latent Neighborhood Graph
Ahmed Abusnaina ... Yizhen Wang
-
Ahmed Abusnaina, et. al.Ahmed Abusnaina ... Yizhen Wang
01 Oct 2021
01 Oct 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Abstract

Highlights

Summary

Talk to us

Similar Papers