Evaluating and Enhancing the Robustness of Sustainable Neural Relationship Classifiers Using Query-Efficient Black-Box Adversarial Attacks

Ijaz Ul Haq,Bashir Hayat,Asif Khan,Arshad Ahmad,Ki-Il Kim,Ye-Eun Lee,Zahid Younas Khan

doi:10.3390/su13115892

Abstract

Neural relation extraction (NRE) models are the backbone of various machine learning tasks, including knowledge base enrichment, information extraction, and document summarization. Despite the vast popularity of these models, their vulnerabilities remain unknown; this is of high concern given their growing use in security-sensitive applications such as question answering and machine translation in the aspects of sustainability. In this study, we demonstrate that NRE models are inherently vulnerable to adversarially crafted text that contains imperceptible modifications of the original but can mislead the target NRE model. Specifically, we propose a novel sustainable term frequency-inverse document frequency (TFIDF) based black-box adversarial attack to evaluate the robustness of state-of-the-art CNN, CGN, LSTM, and BERT-based models on two benchmark RE datasets. Compared with white-box adversarial attacks, black-box attacks impose further constraints on the query budget; thus, efficient black-box attacks remain an open problem. By applying TFIDF to the correctly classified sentences of each class label in the test set, the proposed query-efficient method achieves a reduction of up to 70% in the number of queries to the target model for identifying important text items. Based on these items, we design both character- and word-level perturbations to generate adversarial examples. The proposed attack successfully reduces the accuracy of six representative models from an average F1 score of 80% to below 20%. The generated adversarial examples were evaluated by humans and are considered semantically similar. Moreover, we discuss defense strategies that mitigate such attacks, and the potential countermeasures that could be deployed in order to improve sustainability of the proposed scheme.

Highlights

The proposed adversarial attack successfully reduced the accuracy of the targeted models to under 20%, and not more than 20% of the words were perturbed in a sentence
We propose a novel query-efficient term frequency-inverse document frequency (TFIDF)-based black-box adversarial attack and generate semantically similar and plausible adversarial examples for Neural relation extraction (NRE) task
The transferability of adversarial attacks on NRE models has not been considered. We evaluated this property by generating adversarial texts on both the Relation extraction (RE) datasets used in this study and the corresponding NRE models

Summary

Introduction

Relation extraction (RE) is the classification of relations between two entities. It is important and useful in several applications of natural language processing (NLP) [1], such as question answering [2], information extraction [3], knowledge-base machine translation, and document summarization. Deep neural networks (DNNs) have been applied successfully in a variety of NLP tasks [4,5]. The outstanding performance of these DNNs on complex data has received attention from researchers and are frequently used in different domains such as NLP, computer vision, speech recognition, etc.

Objectives

Methods

Findings

Conclusion