Abstract

Recent studies have shown that natural language processing (NLP) models are vulnerable to adversarial examples, which are maliciously designed by adding small perturbations to benign inputs that are imperceptible to the human eye, leading to false predictions by the target model. Compared to character- and sentence-level textual adversarial attacks, word-level attack can generate higher-quality adversarial examples, especially in a black-box setting. However, existing attack methods usually require a huge number of queries to successfully deceive the target model, which is costly in a real adversarial scenario. Hence, finding appropriate models is difficult. Therefore, we propose a novel attack method, the main idea of which is to fully utilize the adversarial examples generated by the local model and transfer part of the attack to the local model to complete ahead of time, thereby reducing costs related to attacking the target model. Extensive experiments conducted on three public benchmarks show that our attack method can not only improve the success rate but also reduce the cost, while outperforming the baselines by a significant margin.

Highlights

  • Despite the impressive performance of deep neural networks (DNNs) in various fields, DNNs are known to be vulnerable to adversarial examples that are maliciously crafted by adding imperceptible interference to humans [1]

  • We propose a new black-box adversarial attack method whose main idea is to make full use of the adversarial examples generated by the local model, and complete the process of the targeted attack in advance by transferring the part of the process to the local model, which includes three main parts

  • We proposed a new black-box text adversarial attack strategy that used the information of adversarial examples generated by a local model

Read more

Summary

Introduction

Despite the impressive performance of deep neural networks (DNNs) in various fields, DNNs are known to be vulnerable to adversarial examples that are maliciously crafted by adding imperceptible interference to humans [1]. This vulnerability has attracted great interest as well as raised great concern about the security of DNNs. As a result, many attack methods have been proposed to further explore the vulnerability of DNNs in image, speech, and text domains [2,3,4,5,6]. Perturbations inserted into text are almost impossible to make genuinely imperceptible. Even the slightest character-level perturbation might drastically alter the semantics of the original input or destroy its fluency

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.