Abstract

As a major component of speech signal processing, speech emotion recognition has become increasingly essential to understanding human communication. Benefitting from deep learning, many researchers have proposed various unsupervised models to extract effective emotional features and supervised models to train emotion recognition systems. In this paper, we utilize semi-supervised ladder networks for speech emotion recognition. The model is trained by minimizing the supervised loss and auxiliary unsupervised cost function. The addition of the unsupervised auxiliary task provides powerful discriminative representations of the input features, and is also regarded as the regularization of the emotional supervised task. We also compare the ladder network with other classical autoencoder structures. The experiments were conducted on the interactive emotional dyadic motion capture (IEMOCAP) database, and the results reveal that the proposed methods achieve superior performance with a small number of labelled data and achieves better performance than other methods.

Highlights

  • As one of the main information mediums in human communication, speech contains basic language information, and a wealth of emotional information.Emotion can help people understand real expressions and potential intentions

  • The skip connections between the encoder and decoder ease the pressure of transporting information needed to reconstruct the representations to the top layers

  • We apply semi-supervised learning to speech emotion recognition to explore the effect of the ladder network

Read more

Summary

Introduction

Emotion can help people understand real expressions and potential intentions. Has many applications in human-computer interactions, since it can help machines to understand emotional states like human beings do[1]. Speech emotion recognition can be utilized to monitor customers′ emotional state which reflects their service quality in call centers. The information can help promote service level and reduce the workload of manual evaluation[2]. Emotion is conventionally represented as several discrete human emotional moods such as happiness, sadness and anger over utterances[3]. The establishment of a speech emotional database is based on the reality that every speech utterance is assigned to a certain one of emotional categories. Most researchers regard speech emotion recognition as a typical supervised learning task.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call