Abstract

Speech signals contain abundant information on personal emotions, which plays an important part in the representation of human potential characteristics and expressions. However, the deficiency of emotion speech data affects the development of speech emotion recognition (SER), which also limits the promotion of recognition accuracy. Currently, the most effective approach is to make use of unsupervised feature learning techniques to extract speech features from available speech data and generate emotion classifiers with these features. In this paper, we proposed to implement autoencoders such as a denoising autoencoder (DAE) and an adversarial autoencoder (AAE) to extract the features from LibriSpeech for model pre-training, and then conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets for classification. Considering the imbalance of data distribution in IEMOCAP, we developed a novel data augmentation approach to optimize the overlap shift between consecutive segments and redesigned the data division. The best classification accuracy reached 78.67% (weighted accuracy, WA) and 76.89% (unweighted accuracy, UA) with AAE. Compared with state-of-the-art results to our knowledge (76.18% of WA and 76.36% of UA with the supervised learning method), we achieved a slight advantage. This suggests that using unsupervised learning benefits the development of SER and provides a new approach to eliminate the problem of data scarcity.

Highlights

  • Language is the most fundamental mode of human emotional expression and speech is the main way humans communicate

  • We explored the practicability of applying the unsupervised learning method into speech features and implemented speech emotion recognition

  • We proposed to adapt multiple autoencoders for feature extraction and utilize a convolutional neutral network for classification, which were combined to analyze the influence on speech emotion recognition result

Read more

Summary

Introduction

Language is the most fundamental mode of human emotional expression and speech is the main way humans communicate. The emotional states affect human interaction, such as facial expression [3,4], body posture [5], communication content [6], and speech mannerisms [7]. Most emotional recognition tasks are accomplished with hand-crafted features, which requires the guarantee of data validity and quantity. To avoid the lack of data and make feature extraction objective, developing an unsupervised learning method in emotion recognition is worthwhile. We attempt to design a robust and generic mechanism to recognize emotions accurately with acoustic features extracted from widely used public speech data by unsupervised learning approaches of autoencoder model

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call