Abstract

Speech emotion recognition (ASC) is an important part of speech recognition. The convolutional neural network (CNN) has shown excellent performance in ASC. However, the Traditional CNN models usually use the last fully connected layer as the output of the model. Lack of time- frequency related information research results in an inadequate depth of emotional information mining. Moreover, CNN only focuses on the convolution region in each convolution and cannot focus on the global features. To better explore the recognition effect of speech features in CNN, a CNN model with multiple pooling strategies is proposed. Mel-spectrum is the input to the model. CNN learns emotional details. In addition, three global average pooling layers (GAP) connect different pooling layers to reduce the dimensions of different tensors. Then, an Average Pyramid Pooling (APP) layer is used to fuse all features. Finally, a SoftMax classifier is used to classify different emotions. Experiments prove that the improved CNN model has a better recognition effect than the traditional model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call