Masking and purifying inputs for blocking textual adversarial attacks

Huan Zhang,Zhaoquan Gu,Hao Tan,Le Wang,Ziqi Zhu,Yushun Xie,Jianxin Li

doi:10.1016/j.ins.2023.119501

Huan Zhang, Zhaoquan Gu + Show 5 more

https://doi.org/10.1016/j.ins.2023.119501

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

The vulnerability of deep neural networks (DNNs) to adversarial attacks has attracted attention in many fields, and researchers have sought methods to improve the robustness of DNNs. Most existing methods are empirical defenses that can only cope with known attack environments and specific attack conditions and are likely to be broken by more powerful attacks. In contrast, certified defense methods provide theoretical proof of robustness bounds and provide worst-case adversarial robustness that can guarantee reliable robustness for the target model. To ensure the security of DNNs against adversarial attacks, it is crucial to develop certified defense methods. However, the currently available study has paid little attention to certified defenses, and the currently certified defense commonly relies on unrealistic prior knowledge and assumptions, which is limited in practice. In this paper, we propose a model-agnostic and attack-agnostic certified defense method that implements denoising and refactoring of input samples through the Makser and the Purifier. The Makser is rule-based and the Purifier is trained by self-supervised fine-tuning on the improved BERT-MLM, and the whole implementation is independent of any target models or attack methods. We have theoretically demonstrated that the condition for method satisfying certified robustness and obtained its robustness boundary through simulation experiments. Meanwhile, experiments on three datasets demonstrate that our proposed method has superior performance in balancing the prediction accuracy on clean examples and the robustness against adversarial attacks compared with six state-of-the-art defense methods.

Full Text