Abstract

Deep neural networks (DNNs) have been recently shown to be vulnerable to backdoor attacks. The infected model performs well on benign testing samples, however, the attacker can trigger the infected model to misbehave by the backdoor. In the field of natural language processing (NLP), some backdoor attack methods have been proposed, and achieved high attack success rates on a variety of popular models. However, researches on the defense of textual backdoor attacks are lacking and the defense effects are bad at present. In this paper, we propose an effective textual backdoor defense model, namely BDDR, which contains two steps: (1) detecting suspicious words in the sample and (2) reconstructing the original text by deletion or replacement. In the replacement part, we use the pre-trained masking language model taking BERT as an example to generate replacement words. We conduct exhaustive experiments to evaluate our proposed defense model by defending against various backdoor attacks on two infected models trained using two benchmark datasets. Overall, BDDR reduces the attack success rate of word-level backdoor attacks by more than 90%, and reduces the attack success rate of sentence-level backdoor attacks by more than 60%. The experimental results show that our proposed method can always significantly reduce the attack success rate compared with the baseline method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call