Abstract
Adversarial attacks in the field of Natural Language Processing greatly undermine the effectiveness and safety of models, raising significant challenges when it comes to real-world implementation. The researchers suggested using detection methods to identify and reject hostile samples while maintaining the accuracy of the original model. Nevertheless, current detection methods depend on analyzing a single characteristic, resulting in restricted resilience and flexibility. To address these constraints, we proposed the Multiple Adversarial Features Detector (MAFD), an innovative detection technique that utilizes a wide range of adversarial features, such as segmented perplexity, word frequency, and probability distribution, to enhance the effectiveness of detecting adversarial examples. Our comprehensive experiments shows that MAFD outperforms existing advanced methods in terms of detection accuracy and displays significant robustness and adaptability when applied to various base detectors and attack scenarios. In addition, the design of MAFD facilitates the seamless integration of further adversarial features, hence enhancing its detection capabilities.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have