Abstract

Speech emotion recognition is a challenging and widely examined research topic in the field of speech processing. The accuracy of existing models in speech emotion recognition tasks is not high, and the generalization ability is not strong. Since the feature set and model design of effective speech directly affect the accuracy of speech emotion recognition, research on features and models is important. Because emotional expression is often correlated with the global features, local features, and model design of speech, it is often difficult to find a universal solution for effective speech emotion recognition. Based on this, the main research purpose of this paper is to generate general emotion features in speech signals from different angles, and use the ensemble learning model to perform emotion recognition tasks. It is divided into the following aspects: (1) Three expert roles of speech emotion recognition are designed. Expert 1 focuses on three-dimensional feature extraction of local signals; expert 2 focuses on extraction of comprehensive information in local data; and expert 3 emphasizes global features: acoustic feature descriptors (low-level descriptors (LLDs)), high-level statistics functionals (HSFs), and local features and their timing relationships. A single-/multiple-level deep learning model that meets expert characteristics is designed for each expert, including convolutional neural network (CNN), bi-directional long short-term memory (BLSTM), and gated recurrent unit (GRU). Convolutional recurrent neural network (CRNN), based on a combination of an attention mechanism, is used for internal training of experts. (2) By designing an ensemble learning model, each expert can play to its own advantages and evaluate speech emotions from different focuses. (3) Through experiments, the performance of various experts and ensemble learning models in emotion recognition is compared in the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and the validity of the proposed model is verified.

Highlights

  • As the most convenient and natural medium for human communication, speech is the most basic and direct way we have to transmit information to each other

  • Focusing on the above problems, this paper carries out related research on the design of speech emotion features with a multi-level deep learning model and constructed ensemble learning schemes for the comprehensive consideration of multi experts’ suggestions [3]

  • In [16], a deep retinal convolutional neural network is proposed for Speech Emotion Recognition (SER), with advanced features learned from a spectrogram, which is superior to previous studies on the accuracy of emotion recognition

Read more

Summary

Introduction

As the most convenient and natural medium for human communication, speech is the most basic and direct way we have to transmit information to each other. The decision-making aspect of the speech emotion recognition model often plays a decisive role At this time, if the state of the expert is unstable, it directly affects the final emotional judgment. Based on the above research status, some scholars are working to overcome these problems to improve the recognition rate of speech emotions, few experts have fully explored the correlation between global and local features in different roles, features, and models. Focusing on the above problems, this paper carries out related research on the design of speech emotion features with a multi-level deep learning model and constructed ensemble learning schemes for the comprehensive consideration of multi experts’ suggestions [3]. The fifth part is the summary of the work of this paper and the prospects for future work

Related Work
Voiceprint Recognition Technology
Multi-Level Recognition Technology
Ensemble Learning Technology
Design Route for the Overall Model
Expert 1
Analysis and Preprocessing of Speech Signals
Design of Double-Channel Model Based on CNN
Expert 2
Local Comprehensive Feature Extraction
Design of GRU Model Combined with the Attention Mechanism
Expert 3
Feature Selection and Integration
Design of Feature Extraction Model Based on CRNN
Design of Multilevel Model Based on HSFs and CRNN
Design
Design of Ensemble Learning Model
Experimental Preparation
Independent Experiment for Each Expert
Experiment with Ensemble Learning Model
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call