Generating and Protecting Against Adversarial Attacks for Deep Speech-Based Emotion Recognition Models

Zhao Ren,Bjorn Schuller,Alice Baird,Zixing Zhang,Jing Han

doi:10.1109/icassp40776.2020.9054087

Abstract

The development of deep learning models for speech emotion recognition has become a popular area of research. Adversarially generated data can cause false predictions, and in an endeavor to ensure model robustness, defense methods against such attacks should be addressed. With this in mind, in this study, we aim to train deep models to defending against non-targeted white-box adversarial attacks. Adversarial data is first generated from the real data using the fast gradient sign method. Then in the research field of speech emotion recognition, adversarial-based training is employed as a method for protecting against adversarial attack. We then train deep convolutional models with both real and adversarial data, and compare the performances of two adversarial training procedures - namely, vanilla adversarial training, and similarity-based adversarial training. In our experiments, through the use of adversarial data augmentation, both of the considered adversarial training procedures can improve the performance when validated on the real data. Additionally, the similarity-based adversarial training learns a more robust model when working with adversarial data. Finally, the considered VGG-16 model performs the best across all models, for both real and generated data.

Highlights

Emotion recognition has become a popular research topic in recent years, as improving interaction between human and machine is an essential part of Artificial Intelligence (AI) research
When inferring on the fake data (i. e., adversarial attacks), both of the two proposed training approaches perform well on the adversarial data using the three Convolutional Neural Networks (CNNs) architectures, their performance is slightly worse than the performances on the real data
We proposed a system for training a deep speech emotion recognition Convolutional Neural Network (CNN) model to be robust against adversarial attacks

Summary

Introduction

Emotion recognition has become a popular research topic in recent years, as improving interaction between human and machine is an essential part of Artificial Intelligence (AI) research. Systems with integrated speech-based emotion recognition have found many real-life applications, including in Human-robotinteraction (HRI) [1], educational settings [2], and as a diagnosis tool for conditions such as depression [3]. Computational approaches for emotion recognition can be achieved more robustly through multimodal approaches [4, 5]; speech alone has shown to be a valuable modality for such a task, due to the array of information transmitted via the speech signal [6]. Deep learning-based methods have been successful for speech-based emotion recognition [7]. Improving the robustness of deep learning models for real-life implementation is an important factor in AI research [8]. One aspect of concern for robust development of real-world models is, they are vulnerable

Objectives

Methods

Results

Conclusion