Abstract
Backdoor attacks, which maliciously control a well-trained model’s outputs of the instances with specific triggers, are recently shown to be serious threats to the safety of reusing deep neural networks (DNNs). In this work, we propose an efficient online defense mechanism based on robustness-aware perturbations. Specifically, by analyzing the backdoor training process, we point out that there exists a big gap of robustness between poisoned and clean samples. Motivated by this observation, we construct a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models. Moreover, we give a theoretical analysis about the feasibility of our robustness-aware perturbation-based defense method. Experimental results on sentiment analysis and toxic detection tasks show that our method achieves better defending performance and much lower computational costs than existing online defense methods. Our code is available at https://github.com/ lancopku/RAP.
Full Text
Topics from this Paper
Natural Language Processing Models
Backdoor Attacks
Defense Method
Specific Triggers
Clean Samples
+ Show 5 more
Create a personalized feed of these topics
Get StartedTalk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
Jan 1, 2021
May 25, 2021
Journal of Clinical Oncology
Feb 20, 2022
Sep 1, 2021
JAMIA open
Jul 4, 2023
Computers & Security
Jul 1, 2022
Jun 1, 2021
JMIR Medical Informatics
Feb 19, 2021
JCO Clinical Cancer Informatics
Dec 1, 2022
Sensors
Jun 5, 2020
European radiology
Aug 11, 2023
Jul 1, 2020
Circulation
Nov 8, 2022
BMJ Open
Feb 1, 2022