Abstract

The goal of the human interface is to recognize the user’s emotional state precisely. In the speech emotion recognition study, the most important issue is the effective parallel use of the extraction of proper speech features and an appropriate classification engine. Well defined speech databases are also needed to accurately recognize and analyze emotions from speech signals. In this work, we constructed a Korean emotional speech database for speech emotion analysis and proposed a feature combination that can improve emotion recognition performance using a recurrent neural network model. To investigate the acoustic features, which can reflect distinct momentary changes in emotional expression, we extracted F0, Mel-frequency cepstrum coefficients, spectral features, harmonic features, and others. Statistical analysis was performed to select an optimal combination of acoustic features that affect the emotion from speech. We used a recurrent neural network model to classify emotions from speech. The results show the proposed system has more accurate performance than previous studies.

Highlights

  • With the technological development in the information society, high performance personal computers are becoming rapidly popularized

  • The long short-term memory model (LSTM) block comprises memory cells with cyclic structure; and input, forget, and output gates. They differ from conventional Recurrent Neural Network (RNN) models in that LSTM blocks control flow by sending or not sending information appropriately using their gates depending on the specific circumstances

  • When emotion recognition was performed by combining the basic feature combination and harmonic features, the accuracy was 75.46%, which improved around 5% compared to that not using the harmonic features

Read more

Summary

Introduction

With the technological development in the information society, high performance personal computers are becoming rapidly popularized. A speech signal is one of the most natural ways of human communication. It contains linguistic content and implicit paralinguistic information, including the speaker’s emotions. We constructed a Korean emotional speech database (K-EmoDB) for speech emotion analysis and proposed a feature combination that can improve emotion recognition performance, using a Recurrent Neural Network (RNN) model [22]. Based on the evaluation results, 150 emotion data points in each category were chosen to construct the final Korean emotional speech database. Five female and five male, recorded speech data according to specified emotions, producing 10 different sentences for seven kinds of emotion: anger, boredom, disgust, fear, happiness, sadness, and neutral. Each recorded production of an actor was available in three modality formats: AV, video only, and audio only

Speech Emotion Recognition
Feature Selection
Emotion Recognition Model
Experiments and Results
K-EmoDB
International DB
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call