Abstract

Speech emotion recognition (SER) is an active research field of digital signal processing and plays a crucial role in numerous applications of Human–computer interaction (HCI). Nowadays, the baseline state of the art systems has quite a low accuracy and high computations, which needs upgrading to make it reasonable for real-time industrial uses such as detection of content from speech data. The main intent for low recognition rate and high computational cost is a scarceness of datasets, model configuration, and patterns recognition that is the supreme stimulating work for building a robust SER system. In this study, we address these problems and propose a simple and lightweight deep learning-based self-attention module (SAM) for SER system. The transitional features map is given to SAM, which produces efficiently the channel and spatial axes attention map with insignificant overheads. We use a multi-layer perceptron (MLP) in channel attention to extracting global cues and a special dilated convolutional neural network (CNN) in spatial attention to extract spatial info from input tensor. Moreover, we merge, spatial and channel attention maps to produce a combine attention weights as a self-attention module. We placed SAM in the middle of convolutional and connected layers and trained it in an end-to-end mode. The ablation study and comprehensive experimentations are accompanied over IEMOCAP, RAVDESS, and EMO-DB speech emotion datasets. The proposed SER system shows consistent improvements in overall experiments for all datasets and shows 78.01%, 80.00%, and 93.00% average recall, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call