Abstract

Sentiment classification has been broadly applied in real life, such as product recommendation and opinion-oriented analysis. Unfortunately, the widely employed sentiment classification systems based on deep neural networks (DNNs) are susceptible to adversarial attacks with imperceptible perturbations into the legitimate texts (also called adversarial texts ). Adversarial texts could cause erroneous outputs even without access to the target model, bringing security concerns to systems deployed in safety-critical applications. However, studies on defending against adversarial texts are still in the early stage and not ready for tackling the emerging threats, especially in dealing with unknown attacks. Investigating the minor differences between adversarial texts and legitimate texts and enhancing the robustness of target models are two mainstream ideas for defending against adversarial texts. However, both of them suffer the generalization issue in dealing with unknown adversarial attacks. In this paper, we proposed a general method, called TextFirewall , for defending against adversarial texts crafted by various adversarial attacks, which shows the potential in identifying new developed adversarial attacks in the future. Given a piece of text, our TextFirewall identifies the adversarial text by investigating the inconsistency between the target model’s output and the impact value calculated by important words in the text. TextFirewall could be deployed as a third-party tool without modifying the target model and agnostic to the specific type of adversarial texts. Experimental results demonstrate that our proposed TextFirewall effectively identifies adversarial texts generated by the three state-of-the-art (SOTA) attacks and outperforms previous defense techniques. Specifically, TextFirewall achieves an average accuracy of 90.7% on IMDB and 96.9% on Yelp in defending the three SOTA attacks.

Highlights

  • With the rapid development of online social networks, users have produced massive comments that provide valuable information for mining the preference, attitude, and opinion of product and service by using sentiment classification techniques [1], [2]

  • Experimental results show that TextFirewall can effectively distinguish adversarial texts generated by various adversarial attacks and outperform the existing defense methods in accuracy and generality

  • Our proposed TextFirewall works on leveraging the knowledge of legitimate texts without access to various adversarial texts generated by known and unknown adversarial attacks

Read more

Summary

Introduction

With the rapid development of online social networks, users have produced massive comments that provide valuable information for mining the preference, attitude, and opinion of product and service by using sentiment classification techniques [1], [2]. Deep neural networks (DNNs) based sentiment classification models are continuously developed due to the significant progress of DNN in the natural language processing (NLP) field. Studies [3], [4] have shown that DNN-based models are vulnerable to adversarial examples, which are imperceptible to human eyes and. Studies have shown that the adversarial examples appeared in the image [5], [6], text [7], [8], audio [9], and even malware [10], [11]. Adversarial examples bring security concerns to users who deploy the DNN-based systems in safety-critical applications, like selfdriving and text analysis for commercial purposes.

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.