Detecting Backdoor Attacks via Class Difference in Deep Neural Networks

Hyun Kwon

doi:10.1109/access.2020.3032411

Abstract

A backdoor attack implies that deep neural networks misrecognize data that have a specific trigger by additionally training the malicious training data, including the specific trigger to the deep neural network model. In this method, the deep neural network correctly recognizes normal data without triggers, but the network misrecognizes data containing a specific trigger as a target class chosen by the attacker. In this paper, I propose a defense method against backdoor attacks using a detection model. This method detects the backdoor sample by comparing the output result of the target model with that of the model that trained the original secure training dataset. This is a defense method without trigger reverse or access to the entire training dataset. As an experimental environment, I used the Tensorflow machine-learning library, MNIST, and Fashion-MNIST as datasets. The results show that when the partial training data for the detection model are 200, the proposed method showed detection rates of 70.1% and 74.4% for the backdoor samples in MNIST and Fashion-MNIST, respectively.

Highlights

Deep neural networks [1] perform well in machine learning applications, such as image recognition [2] and speech recognition [3]
EXPERIMENTAL RESULTS The attack success rate refers to the rate of matching the class recognized by the target model with the target class intended by the attacker for the backdoor sample
The detection rate is the rate at which the backdoor sample is correctly recognized as the original class by the detection model, but is incorrectly recognized as the target class by the target model

Summary

Introduction

Deep neural networks [1] perform well in machine learning applications, such as image recognition [2] and speech recognition [3]. Barreno et al [4] explained the security threats to deep neural networks by dividing them into exploratory and causative attacks. An exploratory attack misrecognizes a model by manipulating the test data without accessing the training process of the model. A causative attack results in model misrecognition by accessing the learning data of the model. Typical causative attacks are the poisoning attack [6] and backdoor attack [7]. An exploratory attack requires manipulation of real-time test data, but a causative attack is a method of attacking a model during the training process

Methods

Results

Discussion

Conclusion