Adversarial Attack and Defense of Structured Prediction Models

Wenjuan Han,Yong Jiang,Kewei Tu,Liwen Zhang

doi:10.18653/v1/2020.emnlp-main.182

Abstract

Building an effective adversarial attacker and elaborating on countermeasures for adversarial attacks for natural language processing (NLP) have attracted a lot of research in recent years. However, most of the existing approaches focus on classification problems. In this paper, we investigate attacks and defenses for structured prediction tasks in NLP. Besides the difficulty of perturbing discrete words and the sentence fluency problem faced by attackers in any NLP tasks, there is a specific challenge to attackers of structured prediction models: the structured output of structured prediction models is sensitive to small perturbations in the input. To address these problems, we propose a novel and unified framework that learns to attack a structured prediction model using a sequence-to-sequence model with feedbacks from multiple reference models of the same structured prediction task. Based on the proposed attack, we further reinforce the victim model with adversarial training, making its prediction more robust and accurate. We evaluate the proposed framework in dependency parsing and part-of-speech tagging. Automatic and human evaluations show that our proposed framework succeeds in both attacking state-of-the-art structured prediction models and boosting them with adversarial training.

Highlights

IntroductionAdversarial examples, which contain perturbations to the input of a model that elicit large changes in the output, have been shown to be an effective way of assessing the robustness of models in natural language processing (NLP) (Jia and Liang, 2017; Belinkov and Bisk, 2018; Hosseini et al, 2017; Samanta and Mehta, 2017; Alzantot et al., x1 x2 x3 x4 min log P(y∗|x + r) rParser A x1 x2 x3 x4 min log P(y =?|x + r) #1x *BNBXSJUFS x1 x2 x3 x4x *GJSFBXSJUFS x1 x2 x3 x42018; Ebrahimi et al, 2018; Michel et al, 2019; Wang et al, 2019)
We propose a novel framework to attack structured prediction models in natural language processing (NLP)
Our framework consists of a structured-output evaluation criterion based on reference models and a seq2seq sentence generator

Summary

Introduction

Adversarial examples, which contain perturbations to the input of a model that elicit large changes in the output, have been shown to be an effective way of assessing the robustness of models in natural language processing (NLP) (Jia and Liang, 2017; Belinkov and Bisk, 2018; Hosseini et al, 2017; Samanta and Mehta, 2017; Alzantot et al., x1 x2 x3 x4 min log P(y∗|x + r) rParser A x1 x2 x3 x4 min log P(y =?|x + r) #1x *BNBXSJUFS x1 x2 x3 x4x *GJSFBXSJUFS x1 x2 x3 x42018; Ebrahimi et al, 2018; Michel et al, 2019; Wang et al, 2019). Adversarial examples, which contain perturbations to the input of a model that elicit large changes in the output, have been shown to be an effective way of assessing the robustness of models in natural language processing (NLP) (Jia and Liang, 2017; Belinkov and Bisk, 2018; Hosseini et al, 2017; Samanta and Mehta, 2017; Alzantot et al., x1 x2 x3 x4 min log P(y∗|x + r) r. Structured prediction in NLP aims to predict output variables that are mutually dependent or constrained given an input sentence. A structured prediction model predicts the output y given an input sentence x by maximizing the log conditional probability: arg max log P (y|x; Θ)

Objectives

Methods

Results

Conclusion