Abstract

We propose a speech-emotion recognition (SER) model with an “attention-long Long Short-Term Memory (LSTM)-attention” component to combine IS09, a commonly used feature for SER, and mel spectrogram, and we analyze the reliability problem of the interactive emotional dyadic motion capture (IEMOCAP) database. The attention mechanism of the model focuses on emotion-related elements of the IS09 and mel spectrogram feature and the emotion-related duration from the time of the feature. Thus, the model extracts emotion information from a given speech signal. The proposed model for the baseline study achieved a weighted accuracy (WA) of 68% for the improvised dataset of IEMOCAP. However, the WA of the proposed model of the main study and modified models could not achieve more than 68% in the improvised dataset. This is because of the reliability limit of the IEMOCAP dataset. A more reliable dataset is required for a more accurate evaluation of the model’s performance. Therefore, in this study, we reconstructed a more reliable dataset based on the labeling results provided by IEMOCAP. The experimental results of the model for the more reliable dataset confirmed a WA of 73%.

Highlights

  • The emotional state of a person influences their modes of interactions, such as facial expressions, speech characteristics, and the content of communication

  • Because more reliable samples can be seen as just easier samples, which are probably positioned in a farther distance from emotional boundaries in the feature space

  • We attempted to improve the performance of the speech-emotion recognition (SER) model by combining the IS09 features, which are mainly used in SER and mel spectrogram, an level descriptors (LLD), and using them as inputs

Read more

Summary

Introduction

The emotional state of a person influences their modes of interactions, such as facial expressions, speech characteristics, and the content of communication. Since speech is one of the main modes of expression, a human–machine interface must recognize, understand, and respond to emotional stimuli contained in human diction. Emotions affect both vocal and verbal content. We aim to develop a mechanism that can recognize emotions from the acoustic features of utterances [1]. Several studies on speech-emotion recognition have aimed to identify features that enable the discrimination of emotions [2,3]. The most common method of emotion recognition is to extract a large number of statistical features on an utterance, reduce the dimensions using a dimension reduction technique, and classify the features using machine learning algorithms [5,6,7]

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call