Abstract

The pervasive brittleness of deep neural networks has attracted significant attention in recent years. A particularly interesting finding is the existence of adversarial examples, imperceptibly perturbed natural inputs that induce erroneous predictions in state-of-the-art neural models. In this paper, we study a different type of adversarial examples specific to code models, called discrete adversarial examples , which are created through program transformations that preserve the semantics of original inputs.In particular, we propose a novel, general method that is highly effective in attacking a broad range of code models. From the defense perspective, our primary contribution is a theoretical foundation for the application of adversarial training — the most successful algorithm for training robust classifiers — to defending code models against discrete adversarial attack. Motivated by the theoretical results, we present a simple realization of adversarial training that substantially improves the robustness of code models against adversarial attacks in practice. We extensively evaluate both our attack and defense methods. Results show that our discrete attack is significantly more effective than state-of-the-art whether or not defense mechanisms are in place to aid models in resisting attacks. In addition, our realization of adversarial training improves the robustness of all evaluated models by the widest margin against state-of-the-art adversarial attacks as well as our own.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call