Abstract

The advent of Machine Learning as a Service (MLaaS) and deep learning applications has increased the susceptibility of models to adversarial textual attacks, particularly in black-box settings. Prior work on black-box adversarial textual attacks generally follows a stable strategy that involves leveraging char-level, world-level, and sentence-level perturbations, as well as using queries to the target model to find adversarial examples in the search space. However, existing approaches prioritize query efficiency by reducing the search space, thereby overlooking hard-to-attack textual instances. To address this issue, we propose BFS2Adv, a brute force algorithm that generates adversarial examples for both easy-to-attack and hard-to-attack textual inputs. BFS2Adv, starting with an original text, employs word-level perturbations and synonym substitution to construct a comprehensive search space, with each node representing a potential adversarial example. The algorithm systematically explores this space through a breadth-first search, combined with queries to the target model, to effectively identify qualified adversarial examples. We implemented and evaluated a prototype of BFS2Adv against renowned models such as ALBERT and BERT, utilizing the SNLI and MR datasets. Our results demonstrate that BFS2Adv outperforms state-of-the-art algorithms and effectively improves the success rate of short-text adversarial attacks. Furthermore, we provide detailed insights into the robustness of BFS2Adv by analyzing those hard-to-attack examples.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call