Abstract

In the field of speech processing, speech emotion recognition is a challenging task with broad application prospects. Since the effective speech feature set directly affects the accuracy of speech emotion recognition, the research on effective features is one of the key issues in speech emotion recognition. Emotional expression and individualized features are often related, so it is often difficult to find generalized effective speech features, which is one of the main research contents of this paper. It is necessary to generate a general emotional feature representation in the speech signal from the perspective of local features and global features: (1) Using the spectrogram and Convolutional Recurrent Neural Network (CRNN) to construct the speech emotion recognition model, which can effectively learn to represent the spatial characteristics of the emotional information and to obtain the aggravated local feature information. (2) Using Low-Level acoustic Descriptors (LLD), through a large number of experiments, the feature representations of limited dimensions such as energy, fundamental frequency, spectrum and statistical features based on these low-level features are screened to obtain the global feature description. (3) Combining the previous features, and verifying the performance of various features in emotion recognition on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) emotional corpus, the accuracy and representativeness of the features obtained in this paper are verified.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call