Abstract
Speech emotion recognition is a challenging and widely examined research topic in the field of speech processing. The accuracy of existing models in speech emotion recognition tasks is not high, and the generalization ability is not strong. Since the feature set and model design of effective speech directly affect the accuracy of speech emotion recognition, research on features and models is important. Because emotional expression is often correlated with the global features, local features, and model design of speech, it is often difficult to find a universal solution for effective speech emotion recognition. Based on this, the main research purpose of this paper is to generate general emotion features in speech signals from different angles, and use the ensemble learning model to perform emotion recognition tasks. It is divided into the following aspects: (1) Three expert roles of speech emotion recognition are designed. Expert 1 focuses on three-dimensional feature extraction of local signals; expert 2 focuses on extraction of comprehensive information in local data; and expert 3 emphasizes global features: acoustic feature descriptors (low-level descriptors (LLDs)), high-level statistics functionals (HSFs), and local features and their timing relationships. A single-/multiple-level deep learning model that meets expert characteristics is designed for each expert, including convolutional neural network (CNN), bi-directional long short-term memory (BLSTM), and gated recurrent unit (GRU). Convolutional recurrent neural network (CRNN), based on a combination of an attention mechanism, is used for internal training of experts. (2) By designing an ensemble learning model, each expert can play to its own advantages and evaluate speech emotions from different focuses. (3) Through experiments, the performance of various experts and ensemble learning models in emotion recognition is compared in the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and the validity of the proposed model is verified.
Highlights
As the most convenient and natural medium for human communication, speech is the most basic and direct way we have to transmit information to each other
Focusing on the above problems, this paper carries out related research on the design of speech emotion features with a multi-level deep learning model and constructed ensemble learning schemes for the comprehensive consideration of multi experts’ suggestions [3]
In [16], a deep retinal convolutional neural network is proposed for Speech Emotion Recognition (SER), with advanced features learned from a spectrogram, which is superior to previous studies on the accuracy of emotion recognition
Summary
As the most convenient and natural medium for human communication, speech is the most basic and direct way we have to transmit information to each other. The decision-making aspect of the speech emotion recognition model often plays a decisive role At this time, if the state of the expert is unstable, it directly affects the final emotional judgment. Based on the above research status, some scholars are working to overcome these problems to improve the recognition rate of speech emotions, few experts have fully explored the correlation between global and local features in different roles, features, and models. Focusing on the above problems, this paper carries out related research on the design of speech emotion features with a multi-level deep learning model and constructed ensemble learning schemes for the comprehensive consideration of multi experts’ suggestions [3]. The fifth part is the summary of the work of this paper and the prospects for future work
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have