White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense

Shahrukh Khan,Navdeeppal Singh,Mahnoor Shahid

doi:10.54646/bijiiac.004

Abstract

Attention based Transformer models have achieved state-of-the-art results in natural language processing (NLP). However, recent work shows that the underlying attention mechanism can be exploited by adversaries to craft malicious inputs designed to induce spurious outputs, thereby harming model performance and trustworthiness. Unlike in the vision domain, the literature examining neural networks under adversarial conditions in the NLP domain is limited and most of it focuses mainly on the English language. In this paper, we first analyze the adversarial robustness of Bidirectional Encoder Representations from Transformers (BERT) models for German datasets. Second, we introduce two novel NLP attacks. Namely, a character-level and a word-level attacks, both of which utilize attention scores to calculate where to inject character-level and word-level noise, respectively. Finally, we present two defense strategies against the attacks above. The first implicit character-level defense is a variant of adversarial training, which trains a new classifier capable of abstaining/rejecting certain (ideally adversarial) inputs. The other explicit character-level defense learns a latent representation of the complete training data vocabulary and then maps all tokens of an input example to the same latent space, enabling the replacement of all out of vocabulary tokens with the most similar in-vocabulary tokens based on the cosine similarity metric.

Full Text