Defending Deep Neural Networks against Backdoor Attack by Using De-trigger Autoencoder

Hyun Kwon

doi:10.1109/access.2021.3086529

Abstract

A backdoor attack is a method that causes misrecognition in a deep neural network by training it on additional data that have a specific trigger. The network will correctly recognize normal samples (which lack the specific trigger) as their proper classes but will misrecognize backdoor samples (which contain the trigger) as target classes. In this paper, I propose a method of defense against backdoor attacks that uses a de-trigger autoencoder. In the proposed scheme, the trigger in the backdoor sample is removed using the de-trigger autoencoder, and the backdoor sample is detected from the change in the classification result. Experiments were conducted using MNIST, Fashion-MNIST, and CIFAR-10 as the experimental datasets and TensorFlow as the machine learning library. For MNIST, Fashion-MNIST, and CIFAR-10, respectively, the proposed method detected 91.5%, 82.3%, and 90.9% of the backdoor samples and had 96.1%, 89.6%, and 91.2% accuracy on legitimate samples.

Highlights

Deep neural networks [1] provide good performance in the fields of image recognition [2], speech recognition [3], pattern analysis [4], and intrusion detection [5], which are typical machine learning tasks
The detection rate is the rate calculated using the change in classification when the backdoor samples are passed through the de-trigger autoencoder
The backdoor samples that were passed through the de-trigger autoencoder were correctly recognized by the target model because the trigger had been removed

Summary

Introduction

Deep neural networks [1] provide good performance in the fields of image recognition [2], speech recognition [3], pattern analysis [4], and intrusion detection [5], which are typical machine learning tasks. The security risks of machine learning have been categorized by Barreno et al [6] into those from exploratory attacks and those from causative attacks. An exploratory attack [7] is an attack that induces misrecognition by a model that has already been trained, caused by the manipulation of test data. A causative attack [12] is an attack that degrades the accuracy of a model by adding malicious artificial data during the model’s training process. The method proposed in this paper is designed as a method of defense against the backdoor type of causative attack

Methods

Results

Discussion

Conclusion