Abstract

Speech emotion recognition (SER) plays a crucial role in Human–computer interaction (HCI) applications. However, it has two challenges: the lack of effectiveness of deep learning models and data scarcity issues. As a result, the deep learning models used in SER would suffer from overfitting seriously. In this study, a novel SER model called Max-avg-pooling capsule network (MA-CapsNet) is proposed and it is an improved capsule network customized for SER. We also adopt Data augmentation (DA) techniques to tackle the data scarcity issue. The proposed MA-CapsNet model consists of five sequential modules: conv-max-pooling, conv-avg-pooling, convolution, primary capsule, and digital capsule module. Furthermore, a new evaluation metric called the Expected accuracy index (EAI) is presented to evaluate the model performance effectively. The proposed approach demonstrates strong advantages over its peer models under different data partition methods, especially for augmented datasets. Experimental results also show that the proposed model has good interpretability in comparison to the peer methods for its less complicated learning topology, relatively smaller parameter sets, and fewer input features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call