Activation Maximization with a Prior in Speech Data

Sho Inoue,Tad Gonsalves

doi:10.11648/j.ajcst.20210403.13

Abstract

Recently, more and more studies regarding neural networks have been done. However, the learning process of neural networks is often elusive to human beings, which leads to the advent of feature visualization techniques. Activation Maximization (AM) is one of the feature visualization techniques, originally designed for image data. In AM, the input data is optimized to find the data that activates the selected neuron. In this paper, the emotion recognizer’s output is selected as the neuron, and the latent code of a generator (of Generative Adversarial Networks) is optimized instead of the input raw data. The aim of this study is to apply AM to different representations of audio data (waveform-based data and mel-spectrogram-based data) and different model structures (CNN, WaveNet, LSTM), and to find out the most suitable condition for AM in audio domain data. Additionally, we have also tried to visualize the essential features of being a certain class for emotion classification in speech data, using 2 datasets: the Toronto emotional speech set (TESS) and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The mel-spectrogram-based models were found to be superior to the others, showing the distinctive features of selected emotions. More specifically, the CNN-mel-spectrogram-based model was the best in both qualitative and quantitative (FID score) results. Moreover, as demonstrated in this study, AM can also be employed as an output enhancer for generative models.

Highlights

Neural Networks are predominant for processing various tasks such as object detection, speech recognition, emotion detection, and so on
In the former application, the input data is gradually altered to improve the output of the learned classifier to observe the reason for being in a certain class. It is called class-based Activation Maximization. They optimize the noise of the generator in the generative models such as Generative Adversarial Networks (GAN) [4] which is employed as a prior
Since we are not able to post any audio data in this paper, we have uploaded the result with the soundtrack on GitHub

Summary

Introduction

Neural Networks are predominant for processing various tasks such as object detection, speech recognition, emotion detection, and so on Their internal processing is, in general, not understandable for human beings. Both of these are relevant to our experimentation In the former application, the input data is gradually altered to improve the output of the learned classifier to observe the reason for being in a certain class. In the latter concept, they optimize the noise of the generator in the generative models such as Generative Adversarial Networks (GAN) [4] which is employed as a prior

Objectives

Results

Conclusion