Abstract

It is a challenging task to fool a text classifier based on deep neural networks under the black-box setting where the target model can only be queried. Among the existing black-box attacks, decision-based methods have a large query cost due to exponential perturbation space and greedy search strategy. Transfer-based methods, on the other hand, tend to overfit the surrogate model and thus fail when applied to unknown target models. In this paper, we propose a straightforward yet highly effective adversarial attack framework for black-box transformer-based models, thereby exposing vulnerabilities within large language models. Specifically, we leverage a fine-tuned large language model as a white-box surrogate model and optimize a distribution of adversarial text. This distribution is parameterized by a continuous-valued matrix based upon the surrogate model. To avoid overfitting of the distribution and improve its adversarial transferability, we incorporate an additional causal language model into our framework as a constraint model. Based on this constraint model, we add language model perplexity and semantic consistency as regularization terms during the distribution training process. To further reduce the number of queries to the target model, i.e., improve the threat level of examples drawn from our distribution, we employ a geometric loss strategy to ensure that the distribution training process learns the optimal perturbation. Extensive experimental studies have been carried out on benchmark datasets and the results demonstrate significant improvement on the performance and query efficiency under black-box setting in comparison with well-established approaches. Our approach achieves an 80.98% reduction in BERT model accuracy while consuming only 21.86% of the query times required by prior attacks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call