Abstract
Speech emotion recognition (SER) plays a crucial role in Human–computer interaction (HCI) applications. However, it has two challenges: the lack of effectiveness of deep learning models and data scarcity issues. As a result, the deep learning models used in SER would suffer from overfitting seriously. In this study, a novel SER model called Max-avg-pooling capsule network (MA-CapsNet) is proposed and it is an improved capsule network customized for SER. We also adopt Data augmentation (DA) techniques to tackle the data scarcity issue. The proposed MA-CapsNet model consists of five sequential modules: conv-max-pooling, conv-avg-pooling, convolution, primary capsule, and digital capsule module. Furthermore, a new evaluation metric called the Expected accuracy index (EAI) is presented to evaluate the model performance effectively. The proposed approach demonstrates strong advantages over its peer models under different data partition methods, especially for augmented datasets. Experimental results also show that the proposed model has good interpretability in comparison to the peer methods for its less complicated learning topology, relatively smaller parameter sets, and fewer input features.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.