Abstract

State-of-the-art adversarial attacks in the text domain have shown their power to induce machine learning models to produce abnormal outputs. The samples generated in these attacks have three important attributes: attack ability, transferability, and imperceptibility. However, compared with the other two attributes, the imperceptibility of adversarial examples has not been well investigated. Unlike the pixel-level perturbations in images, adversarial perturbations in the text are usually traceable, reflecting changes in characters, words, or sentences. The generation of imperceptible samples in texts is more difficult than in images. Therefore, how to constrain adversarial perturbations added in the text is a crucial step to construct more natural adversarial texts. Unfortunately, recent studies merely select measurements to constrain the added adversarial perturbations, but none of them explain where these measurements are suitable, which one is better, and how they perform in different kinds of adversarial attacks. In this paper, we fill this gap by comparing the performance of these metrics in various attacks. Furthermore, we propose a stricter constraint for word-level attacks to obtain more imperceptible samples. It is also helpful to enhance existing word-level attacks for adversarial training.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call