Stress Recognition from Speech by Combining Image-based Deep Spectrum and Text-based Features

Noussaiba Jaafar,Zied Lachiri

doi:10.1109/itsis56166.2022.10118402

Abstract

This paper investigates how stress can be expressed by speech not only with the acoustic part but also semantic information. This investigation is established in order to recognize the different intensities of stress in surveillance applications. This research aims to recognize stress in human-human interactions at service desk by analyzing their behavioral patterns when interacting with each other's. More specifically, this paper focuses on stress recognition from speech with its verbal (acoustic) and non-verbal (semantic) parts. For this purpose, we combine image-based deep spectrum with text-based features using neural networks. In the acoustic part, we use pre-trained Convolutional Neural Networks (CNNs) to extract descriptors from audio spectrograms. These descriptors are defined as deep spectrum features. Indeed, these features are the activations of fully connected layers from VGG16 which is an image classification CNN. In the semantic part, we adopt text-based features including linguistic, word affect and indications of spontaneous speech. To obtain the final feature set for both deep spectrum and textual features, we apply Multilayer Perceptrons (MLPs) to learn the representations before twinning them with a neural network. Our fusion method achieves accuracies for weighted and unweighted average of 83.57% and 82.46%, respectively.

Full Text