Abstract

Speech emotion recognition, as an important auxiliary component of speech interaction technology, has always been a research hotspot. In this work, we propose a novel framework for speech emotion recognition based on deep neural network. The proposed framework is composed of two main modules: a local feature extractor module that utilizes deep recurrent layers to extract frame-level feature representations and a global feature integration module that learns utterance-level representations for emotion recognition. Two architectures, one multi-granularity convolutional layer and one multi-scale attentive layer are constructed for the feature integration module. Furthermore, we adopt two data augmentation approaches, noise injection and vocal tract length perturbation which both improve the performance and robustness of models and reduce the influence of individual variations. The proposed models achieve recognition accuracies of 92.08% and 90.41% on Emo-DB and CASIA dataset, respectively. In addition, ablation experiments are conducted to show the effectiveness of the proposed feature integration module and data augmentation approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call